<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>莫叶何竹🍀</title>
        <link>http://www.myhz0606.com/</link>
        <description>这是一个由NotionNext生成的站点</description>
        <lastBuildDate>Sun, 14 Jun 2026 13:18:33 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>zh-CN</language>
        <copyright>All rights reserved 2026, 莫叶何竹🍀</copyright>
        <item>
            <title><![CDATA[Step by Step: Understanding ROPE]]></title>
            <link>http://www.myhz0606.com/article/rope</link>
            <guid>http://www.myhz0606.com/article/rope</guid>
            <pubDate>Tue, 23 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[ROPE是目前不论LLM还是VLLM常用的位置编码。本文将step by step梳理个人对ROPE的理解]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-27c3c18ff81c800f8a3cf83fcd3b3158"><div class="notion-viewport"></div><div class="notion-collection-page-properties"></div><div class="notion-text notion-block-2d23c18ff81c802ea020fdbb57644d64">paper: <a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://arxiv.org/abs/2104.09864">https://arxiv.org/abs/2104.09864</a></div><div class="notion-text notion-block-2d23c18ff81c80689339ef2d290e8375">ROPE是目前不论LLM还是VLLM常用的位置编码。本文将step by step梳理个人对ROPE的理解，如有不当之处，敬请指出。</div><div class="notion-text notion-block-2d23c18ff81c803b8ac6e86222126a1c">Attention计算如下，假定当前序列长度为</div><div class="notion-text notion-block-2d23c18ff81c8017b59de813e4d31e9a">对于Attention而言，其score的计算依赖各个位置query，key向量的点积。不妨考虑query的<!-- -->和key的<!-- -->位置的向量，记作</div><div class="notion-text notion-block-2d23c18ff81c801aa1acf5d9a6c631bc">原始 <!-- --> <b>不包含任何位置信息</b>，因此需要引入位置相关的映射，使点积结果依赖于相对位置 <!-- -->。</div><div class="notion-text notion-block-2d23c18ff81c8018960ef885777f70a2">具体来说：</div><div class="notion-text notion-block-2d23c18ff81c80ad8ea5c692336594bf">我们期望找到一个映射</div><div class="notion-text notion-block-2d23c18ff81c80c39596f5fb8d82b890">其输入为:1) 当前位置的向量,2) 位置信息。其输出为添加了位置信息的向量，<!-- -->向量经过这个映射后，点乘能够满足：</div><div class="notion-text notion-block-2d23c18ff81c804cbef0c6209a3c3d2c">如何设计这个映射，使得向量的点积结果依赖相对位置<!-- -->？
加法倒是容易，可以引入指数函数</div><div class="notion-text notion-block-2d23c18ff81c80ab9aacfaac72496813">
通过上面的运算，我们能够实现：添加了绝对位置的向量相乘后具有累积位置的信息。这里离目标已经很近了。</div><div class="notion-text notion-block-2d23c18ff81c8077b900f69134b16eff">核心诉求：如何让上式的<!-- -->变为</div><div class="notion-text notion-block-2d23c18ff81c80f09691ebef70eba107">这里我们不禁想到一个结论：两个复向量的点乘会引入共轭，这个共轭能带来我们想要的负号。</div><div class="notion-text notion-block-2d23c18ff81c80739139e91cd9dda735">在量子力学、信号处理等领域中，为了保证两个复向量的点乘结果是实数，复向量点乘的定义如下：</div><div class="notion-text notion-block-2d23c18ff81c8044af4bc4de60b150d4"> ，<!-- -->是<!-- -->的共轭复数。 PS： <!-- -->其共轭复数</div><div class="notion-text notion-block-2d23c18ff81c80d39ac0f6919ff6d10a">回到上式，我们引入复数<!-- -->将绝对位置信息乘到复数的幅角中，此时,根据上面的复向量的运算规则：</div><div class="notion-text notion-block-2d23c18ff81c801ab4d9cd3af6ba8b78">上面的式子已经实现绝对位置编码引入相对位置信息。这里离ROPE最终的形式已经很近了。</div><div class="notion-text notion-block-2d23c18ff81c80d48393ed65bcecea1c">对上式进一步改进，已知<!-- -->不妨将<!-- -->视作<!-- -->个复向量，此时上式可写为：</div><div class="notion-text notion-block-2d23c18ff81c8061bf56f6bd5abd7247">上式已经是ROPE的最终形式，之所以叫ROPE，是因为根据复数的乘法几何意义</div><div class="notion-text notion-block-2d23c18ff81c80928989f2cfdbb31553">对每一个元素<!-- -->可以理解对<!-- -->进行旋转。</div><div class="notion-text notion-block-2d23c18ff81c808faf9feaf86a2646ba">回顾一下复数的几种表示方法</div><div class="notion-text notion-block-2d23c18ff81c806d81e5f13fdace1bff">因此</div><div class="notion-text notion-block-2d23c18ff81c80f6ad75c588e3d0278b">上式的形式做的逐元素乘法，对于<!-- -->可以把结果都拼接起来</div><div class="notion-text notion-block-2d23c18ff81c80c98f3dc11f2d970668">同理可得</div><div class="notion-blank notion-block-2d23c18ff81c80a2b1e9cf657de89331"> </div><div class="notion-text notion-block-2d23c18ff81c803da8fef951a5880964"></div><div class="notion-text notion-block-2d23c18ff81c80028c28eb2442e55ff8">检查一下</div><div class="notion-text notion-block-2d23c18ff81c802c856edc84afc91835">可见，两个引入绝对位置编码的向量，点积后包含相对位置信息。</div><blockquote class="notion-quote notion-block-2d23c18ff81c80219e46fe1793afe212"><div>注意，上面的形式需要<!-- -->的通道数能够被<!-- -->整除
</div></blockquote><div class="notion-blank notion-block-2d23c18ff81c80dc94b9cc868601a1a7"> </div><div class="notion-text notion-block-2d23c18ff81c80e98721f1749c30a53b">one more thing</div><div class="notion-text notion-block-2d23c18ff81c80c5b808ce919a1cfb7a">上述的推导，我们人为将 <!-- -->以自上而下的顺序，间隔步长为2，group成<!-- -->组<!-- -->的向量，将其视为复数的向量形式，并分配旋转矩阵(即<!-- -->)。我们当然可以更改这个group的规则，目前实践中用的比较多的是将<!-- -->的向量group在一起，这样代码实现和效率上有一定的优势，此时</div><div class="notion-text notion-block-2d23c18ff81c807ca485dd9fe197f8fa">是逐元素乘法，同时更改通道的顺序，</div><div class="notion-text notion-block-3743c18ff81c80909573e02538fc7cd1">点乘后的计算等价</div><div class="notion-text notion-block-2d23c18ff81c8028ae04d5140cf64fce">通常源码中都用以上实现方式。</div><div class="notion-blank notion-block-2d23c18ff81c8032965cdc08c2709b30"> </div><div class="notion-text notion-block-2d23c18ff81c80e88be9c02f16f954fc">ROPE的二维形式</div><div class="notion-text notion-block-2d23c18ff81c809bae8cc3f143a27290">目标变为，找到一个<!-- -->使得</div><div class="notion-text notion-block-2d23c18ff81c808f83cfefe1a70915af">如果直接类比1D的ROPE构造：</div><div class="notion-text notion-block-2d23c18ff81c8021becff68d2f2c106f"></div><div class="notion-text notion-block-2d23c18ff81c80048d75c3c17b5e6f2e">他只能建立<!-- -->，不是我们想要的。</div><div class="notion-text notion-block-2d23c18ff81c80dc92d9fb6695de79af">通过前文我们找到一个map <!-- -->使得</div><div class="notion-text notion-block-2d23c18ff81c804b9290e52245778952">核心思路，一半通道编码<!-- -->的位置，另一半编码<!-- -->位置。本质上在两个正交的子空间上分别做1D ROPE。</div><blockquote class="notion-quote notion-block-2d23c18ff81c8036ac72f9b21bbd3b2f"><div>注意，上面的形式需要<!-- -->的通道数能够被<!-- -->整除</div></blockquote><div class="notion-text notion-block-2d23c18ff81c80cf8e78f9c4e675b750">当然ROPE还有很多变体，如3D-ROPE，partial ROPE等。理解了本质，再类比理解就容易了。</div></main></div>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[实例分割新范式：Falcon Perception技术剖析]]></title>
            <link>http://www.myhz0606.com/article/falcon_perception</link>
            <guid>http://www.myhz0606.com/article/falcon_perception</guid>
            <pubDate>Fri, 29 May 2026 16:00:00 GMT</pubDate>
            <description><![CDATA[这篇文章讨论的是一个很有意思的问题：dense perception任务是否一定需要encoder-decoder结构？
目前开放词汇检测、promptable segmentation、OCR这类任务，常见做法大概是：
• 先用一个vision backbone提取图像features
• 单独的 decoder 或 late-fusion module 将这些 features 转换为任务输出
虽然上面的范式在业内已经验证了有效性，但它的问题也很明显。模块越多，视觉语言的交互较晚，并且系统的复杂度也会更高。
针对dense perception的任务特点，作者提出以下关键设计：1)Unified Dense Transformer with Hybrid Attention Mask; 2)Chain-of-Perception; 3)Specialized heads]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-3653c18ff81c804990ffe5bb035482b6"><div class="notion-viewport"></div><div class="notion-collection-page-properties"><div class="notion-collection-row"><div class="notion-collection-row-body"><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 13A6 6 0 107 1a6 6 0 000 12zM3.751 5.323A.2.2 0 013.909 5h6.182a.2.2 0 01.158.323L7.158 9.297a.2.2 0 01-.316 0L3.751 5.323z"></path></svg><div class="notion-collection-column-title-body">type</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-select"><div class="notion-property-select-item notion-item-purple">Post</div></span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 13A6 6 0 107 1a6 6 0 000 12zM3.751 5.323A.2.2 0 013.909 5h6.182a.2.2 0 01.158.323L7.158 9.297a.2.2 0 01-.316 0L3.751 5.323z"></path></svg><div class="notion-collection-column-title-body">status</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-select"><div class="notion-property-select-item notion-item-red">Published</div></span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M10.889 5.5H3.11v1.556h7.778V5.5zm1.555-4.444h-.777V0H10.11v1.056H3.89V0H2.333v1.056h-.777c-.864 0-1.548.7-1.548 1.555L0 12.5c0 .856.692 1.5 1.556 1.5h10.888C13.3 14 14 13.356 14 12.5V2.611c0-.855-.7-1.555-1.556-1.555zm0 11.444H1.556V3.944h10.888V12.5zM8.556 8.611H3.11v1.556h5.445V8.61z"></path></svg><div class="notion-collection-column-title-body">date</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-date">May 30, 2026</span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 4.568a.5.5 0 00-.5-.5h-6a.5.5 0 00-.5.5v1.046a.5.5 0 00.5.5h6a.5.5 0 00.5-.5V4.568zM.5 1a.5.5 0 00-.5.5v1.045a.5.5 0 00.5.5h12a.5.5 0 00.5-.5V1.5a.5.5 0 00-.5-.5H.5zM0 8.682a.5.5 0 00.5.5h11a.5.5 0 00.5-.5V7.636a.5.5 0 00-.5-.5H.5a.5.5 0 00-.5.5v1.046zm0 3.068a.5.5 0 00.5.5h9a.5.5 0 00.5-.5v-1.045a.5.5 0 00-.5-.5h-9a.5.5 0 00-.5.5v1.045z"></path></svg><div class="notion-collection-column-title-body">slug</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-text">falcon_perception</span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 4.568a.5.5 0 00-.5-.5h-6a.5.5 0 00-.5.5v1.046a.5.5 0 00.5.5h6a.5.5 0 00.5-.5V4.568zM.5 1a.5.5 0 00-.5.5v1.045a.5.5 0 00.5.5h12a.5.5 0 00.5-.5V1.5a.5.5 0 00-.5-.5H.5zM0 8.682a.5.5 0 00.5.5h11a.5.5 0 00.5-.5V7.636a.5.5 0 00-.5-.5H.5a.5.5 0 00-.5.5v1.046zm0 3.068a.5.5 0 00.5.5h9a.5.5 0 00.5-.5v-1.045a.5.5 0 00-.5-.5h-9a.5.5 0 00-.5.5v1.045z"></path></svg><div class="notion-collection-column-title-body">summary</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-text">这篇文章讨论的是一个很有意思的问题：dense perception任务是否一定需要encoder-decoder结构？
目前开放词汇检测、promptable segmentation、OCR这类任务，常见做法大概是：
• 先用一个vision backbone提取图像features
• 单独的 decoder 或 late-fusion module 将这些 features 转换为任务输出
虽然上面的范式在业内已经验证了有效性，但它的问题也很明显。模块越多，视觉语言的交互较晚，并且系统的复杂度也会更高。
针对dense perception的任务特点，作者提出以下关键设计：1)Unified Dense Transformer with Hybrid Attention Mask; 2)Chain-of-Perception; 3)Specialized heads</span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M4 3a1 1 0 011-1h7a1 1 0 110 2H5a1 1 0 01-1-1zm0 4a1 1 0 011-1h7a1 1 0 110 2H5a1 1 0 01-1-1zm0 4a1 1 0 011-1h7a1 1 0 110 2H5a1 1 0 01-1-1zM2 4a1 1 0 110-2 1 1 0 010 2zm0 4a1 1 0 110-2 1 1 0 010 2zm0 4a1 1 0 110-2 1 1 0 010 2z"></path></svg><div class="notion-collection-column-title-body">tags</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-multi_select"><div class="notion-property-multi_select-item notion-item-brown">多模态</div><div class="notion-property-multi_select-item notion-item-red">图像分割</div></span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 13A6 6 0 107 1a6 6 0 000 12zM3.751 5.323A.2.2 0 013.909 5h6.182a.2.2 0 01.158.323L7.158 9.297a.2.2 0 01-.316 0L3.751 5.323z"></path></svg><div class="notion-collection-column-title-body">category</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-select"><div class="notion-property-select-item notion-item-purple">学习分享</div></span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 4.568a.5.5 0 00-.5-.5h-6a.5.5 0 00-.5.5v1.046a.5.5 0 00.5.5h6a.5.5 0 00.5-.5V4.568zM.5 1a.5.5 0 00-.5.5v1.045a.5.5 0 00.5.5h12a.5.5 0 00.5-.5V1.5a.5.5 0 00-.5-.5H.5zM0 8.682a.5.5 0 00.5.5h11a.5.5 0 00.5-.5V7.636a.5.5 0 00-.5-.5H.5a.5.5 0 00-.5.5v1.046zm0 3.068a.5.5 0 00.5.5h9a.5.5 0 00.5-.5v-1.045a.5.5 0 00-.5-.5h-9a.5.5 0 00-.5.5v1.045z"></path></svg><div class="notion-collection-column-title-body">icon</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-text"></span></div></div><div class="notion-collection-row-property"><div class="notion-collection-column-title"><svg viewBox="0 0 14 14" class="notion-collection-column-title-icon"><path d="M7 4.568a.5.5 0 00-.5-.5h-6a.5.5 0 00-.5.5v1.046a.5.5 0 00.5.5h6a.5.5 0 00.5-.5V4.568zM.5 1a.5.5 0 00-.5.5v1.045a.5.5 0 00.5.5h12a.5.5 0 00.5-.5V1.5a.5.5 0 00-.5-.5H.5zM0 8.682a.5.5 0 00.5.5h11a.5.5 0 00.5-.5V7.636a.5.5 0 00-.5-.5H.5a.5.5 0 00-.5.5v1.046zm0 3.068a.5.5 0 00.5.5h9a.5.5 0 00.5-.5v-1.045a.5.5 0 00-.5-.5h-9a.5.5 0 00-.5.5v1.045z"></path></svg><div class="notion-collection-column-title-body">password</div></div><div class="notion-collection-row-value"><span class="notion-property notion-property-text"></span></div></div></div></div></div><table class="notion-simple-table notion-block-3703c18ff81c80c98dd9c10dc1c4abc1"><tbody><tr class="notion-simple-table-row notion-simple-table-header-row notion-block-3703c18ff81c80b7bf5fd096b52a6486"><td class="" style="width:120px"><div class="notion-simple-table-cell">ㅤ</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">info</div></td></tr><tr class="notion-simple-table-row notion-block-3703c18ff81c805e9c0ce5531912a6ff"><td class="" style="width:120px"><div class="notion-simple-table-cell">paper</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell"><a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://arxiv.org/abs/2603.27365">https://arxiv.org/abs/2603.27365</a></div></td></tr><tr class="notion-simple-table-row notion-block-3703c18ff81c806c9592dec13f0cedf5"><td class="" style="width:120px"><div class="notion-simple-table-cell">code</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell"><a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception">https://github.com/tiiuae/Falcon-Perception</a></div></td></tr><tr class="notion-simple-table-row notion-block-3703c18ff81c80418afcc9e399271983"><td class="" style="width:120px"><div class="notion-simple-table-cell">org</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">TII Falcon Vision Team</div></td></tr><tr class="notion-simple-table-row notion-block-3703c18ff81c8048aeb7c3029244664f"><td class="" style="width:120px"><div class="notion-simple-table-cell">model</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell"><a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://huggingface.co/tiiuae/Falcon-Perception">https://huggingface.co/tiiuae/Falcon-Perception</a></div></td></tr><tr class="notion-simple-table-row notion-block-3703c18ff81c8086aeeef552d5f91fd3"><td class="" style="width:120px"><div class="notion-simple-table-cell">benchmark</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell"><a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://huggingface.co/datasets/tiiuae/PBench">https://huggingface.co/datasets/tiiuae/PBench</a></div></td></tr></tbody></table><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3703c18ff81c80d8ae04ff6225b8eb2a" data-id="3703c18ff81c80d8ae04ff6225b8eb2a"><span><div id="3703c18ff81c80d8ae04ff6225b8eb2a" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80d8ae04ff6225b8eb2a" title="1 Motivation"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">1 Motivation</span></span></h2><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c80bf86b8c21a50ee3bb4"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://file.notion.so/f/f/843944ec-ceca-4d0d-9958-d8fc309e4037/c880e0f1-946d-4a5a-9d8a-4c05c78a5f24/falcon_perception.gif?table=block&amp;id=3703c18f-f81c-80bf-86b8-c21a50ee3bb4&amp;spaceId=843944ec-ceca-4d0d-9958-d8fc309e4037&amp;expirationTimestamp=1781467200000&amp;signature=6hdyizwre7MDevJ49ouKG1Zg2I2tDJ2Hw_fYy-FcD5Q&amp;t=3703c18f-f81c-80bf-86b8-c21a50ee3bb4" alt="notion image" loading="lazy" decoding="async"/></div></figure><div class="notion-text notion-block-3703c18ff81c807fa9e4f444a99c4d78">这篇文章讨论的是一个很有意思的问题：dense perception任务是否一定需要encoder-decoder结构？</div><div class="notion-text notion-block-3703c18ff81c80c5827ef84fc3db633c">目前开放词汇检测、promptable segmentation、OCR这类任务，常见做法大概是：</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80d3aee2f694122fb2bd"><li>先用一个vision backbone提取图像features</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8001bdeefb34197dd250"><li>单独的 decoder 或 late-fusion module 将这些 features 转换为任务输出</li></ul><div class="notion-text notion-block-3703c18ff81c809db093c21825c29672">虽然上面的范式在业内已经验证了有效性，但它的问题也很明显。模块越多，视觉语言的交互较晚，并且系统的复杂度也会更高。</div><div class="notion-text notion-block-3703c18ff81c80baba46f9660472fe65">针对dense perception的任务特点，作者提出以下关键设计：1)Unified Dense Transformer with Hybrid Attention Mask; 2)Chain-of-Perception; 3)Specialized heads</div><div class="notion-text notion-block-3703c18ff81c80cabb15da3ab9d77498">下面具体来看。</div><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3703c18ff81c8016a839dc2143732d3f" data-id="3703c18ff81c8016a839dc2143732d3f"><span><div id="3703c18ff81c8016a839dc2143732d3f" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c8016a839dc2143732d3f" title="2 Method"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2 Method</span></span></h2><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3703c18ff81c80799888e817dac6921b" data-id="3703c18ff81c80799888e817dac6921b"><span><div id="3703c18ff81c80799888e817dac6921b" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80799888e817dac6921b" title="2.1 问题定义（Chain-of-Perception）"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.1 问题定义（Chain-of-Perception）</span></span></h3><div class="notion-text notion-block-3703c18ff81c8067854cc1bd052d4cb6">对于dense perception的任务而言，一张图片的实例可能在0到几百不等，如何设计输出接口即支持variable-length instances又不会让decoding的代价过高是关键。</div><div class="notion-text notion-block-3703c18ff81c80578f1fd97a923baad6">Falcon-Perception采用自回归的架构做variable-length instances的生成。如果直接把bbox、polygon或者mask都离散成普通token，每个实例都会带来很长的输出序列，实例一多decoding代价就会迅速上升。为了规避这个问题，Falcon-Perception设计了一种Chain-of-Perception的生成方法。</div><div class="notion-text notion-block-3703c18ff81c80a592e9fa08cdd9ac82">具体来说：假定自回归模型输入的prefix如：</div><div class="notion-text notion-block-3703c18ff81c8080a626e2ed5f4d7433">其输出的序列</div><div class="notion-text notion-block-3703c18ff81c803ba39cf49386e6d60b">从上面不难看出，在segmentation路径下，每一个实例都由：<code class="notion-inline-code">&lt;coord&gt;</code> <code class="notion-inline-code">&lt;size&gt;</code> <code class="notion-inline-code">&lt;seg&gt;</code> 3个触发token构成。</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c806bba44f12c7ad52a9a"><li><code class="notion-inline-code">&lt;coord&gt;</code> 代表实例的中心点坐标</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c804684a7f639521e35f6"><li><code class="notion-inline-code">&lt;size&gt;</code>代表实例的宽高</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80838470e15427a9d530"><li><code class="notion-inline-code">&lt;seg&gt;</code>代表实例mask query的生成位置</li></ul><div class="notion-text notion-block-3703c18ff81c8007809bdb3b4fc2e758">通过这个方法，模型没有把真实mask逐点写进文本序列，而是把一个实例的输出接口压缩为几个结构化token；连续坐标和高分辨率mask再交给specialized heads解码。并且<code class="notion-inline-code">&lt;coord&gt;</code> -&gt; <code class="notion-inline-code">&lt;size&gt;</code> -&gt; <code class="notion-inline-code">&lt;seg&gt;</code>这个生成顺序要求模型先解决spatial ambiguity，再解决pixel-level details，是一个coarse-to-fine的curriculum过程，训练相对稳定。</div><blockquote class="notion-quote notion-block-3703c18ff81c8054b9bce678a14f6258"><div>从 <code class="notion-inline-code">&lt;coord&gt;</code> <code class="notion-inline-code">&lt;size&gt;</code> <code class="notion-inline-code">&lt;seg&gt;</code>  解码出实例真实的mask需要额外的special head，在后面的章节详细介绍。</div></blockquote><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3703c18ff81c809aa662fc248bb53b53" data-id="3703c18ff81c809aa662fc248bb53b53"><span><div id="3703c18ff81c809aa662fc248bb53b53" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c809aa662fc248bb53b53" title="2.2 模型架构"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.2 模型架构</span></span></h3><div class="notion-text notion-block-3703c18ff81c80eaaa78cd5c279655d3">模型架构上，作者放弃了额外引入encoder，而是类似Fuyu8B的做法，直接将图片patch化与文本拼接后，送入到单一的transformer中。这里有几点设计细节需要注意：</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c80758689df55265e6761" data-id="3703c18ff81c80758689df55265e6761"><span><div id="3703c18ff81c80758689df55265e6761" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80758689df55265e6761" title="2.2.1 输入准备"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.2.1 输入准备</span></span></h4><div class="notion-text notion-block-3703c18ff81c80d0b988f04ad7e68f1c">以一个具体的例子理解</div><div class="notion-text notion-block-3703c18ff81c80ad85bece2b00ae92a2"><code class="notion-inline-code">&lt;|image|&gt;Segment these expressions in the image:&lt;|start_of_query|&gt;all objects&lt;|REF_SEG|&gt;</code></div><div class="notion-text notion-block-3703c18ff81c80f7b15ef26724ac768c"><code class="notion-inline-code">&lt;|image|&gt;</code>是图片token的占位符。</div><div class="notion-text notion-block-3703c18ff81c80579a25dc703cbb8bb1">给定输入图片，先patch化，再经过linear层后，放到占位处，与文本token进行拼接
$$
X=[\underbrace{v_{1},\ldots,v_{N}}<em>{\mathrm{Visual~Embeddings}},\underbrace{t</em>{1},\ldots,t_{L}}_{\mathrm{Text~Embeddings}}]
$$</div><blockquote class="notion-quote notion-block-3703c18ff81c809384d7ddb64e8b1e33"><div>注1: 对于常规的多模态模型会将visual embedding部分送入到encoder，将encoder的输出与text embedding进行拼接，falcon perception省略了这个过程。</div></blockquote><blockquote class="notion-quote notion-block-3703c18ff81c80dfa606f6de3189c718"><div>注2: 图片token之间会被表征图片起始、终止的token包裹。并沿用了Dino-v2的register token，以提升patch token聚合信息的质量。</div></blockquote><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c80feb0cde85d3d8a8062" data-id="3703c18ff81c80feb0cde85d3d8a8062"><span><div id="3703c18ff81c80feb0cde85d3d8a8062" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80feb0cde85d3d8a8062" title="2.2.2 hybrid attention mask"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.2.2 hybrid attention mask</span></span></h4><div class="notion-text notion-block-3703c18ff81c8058ad30e2f358604243">falcon-perception的hybrid attention mask可以理解为一种image-prefix bidirectional attention + 后续causal attention。具体而言：</div><div class="notion-text notion-block-3703c18ff81c8014bda1f18517e05c0c">图片token之间设置为bidirectional attention，文本和task位置为causal attention；文本/task token可以看到完整的image prefix</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c80d38983d421e6da429d"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A98593fc4-761f-48ec-89fc-5d43dd53c36f%3Aimage-20260526093150639.png?table=block&amp;id=3703c18f-f81c-80d3-8983-d421e6da429d&amp;t=3703c18f-f81c-80d3-8983-d421e6da429d" alt="notion image" loading="lazy" decoding="async"/></div></figure><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c80e3be74de77aca848a3" data-id="3703c18ff81c80e3be74de77aca848a3"><span><div id="3703c18ff81c80e3be74de77aca848a3" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80e3be74de77aca848a3" title="2.2.3 3D RoPE"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.2.3 3D RoPE</span></span></h4><div class="notion-text notion-block-3703c18ff81c8001a161ef7c5e6e3a66">falcon-perception的3D-RoPE与Qwen系列的MRoPE实现不同。它将q，k的通道进行对半切分，一半用于编码空间位置信息的2D-RoPE，另一半用于序列/时序信息的1D-RoPE编码。这里有几个小细节：</div><ol start="1" class="notion-list notion-list-numbered notion-block-3703c18ff81c80149710d973ebe3aaad" style="list-style-type:decimal"><li>时序维度，image block内部基本共享同一个temporal ID，后续文本token再继续递增。</li></ol><ol start="2" class="notion-list notion-list-numbered notion-block-3703c18ff81c80afbaa7c40e621f14db" style="list-style-type:decimal"><li>空间维度，只有真实image patch会被赋予2D空间坐标；文本和其它非空间token的2D位置为0。</li></ol><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3703c18ff81c80e3a6f6fcb7e0a9436f" data-id="3703c18ff81c80e3a6f6fcb7e0a9436f"><span><div id="3703c18ff81c80e3a6f6fcb7e0a9436f" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80e3a6f6fcb7e0a9436f" title="2.3 Special Head"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.3 Special Head</span></span></h3><div class="notion-text notion-block-3703c18ff81c808b87b7cf68ea7ef854">在2.1，2.2节中，我们已经理解了falcon-perception如何定义dense-perception问题，以及它如何把图像patch、文本prompt和任务token放进同一个Transformer。本小节补上最后一块拼图：模型预测出的special token <code class="notion-inline-code">&lt;coord&gt;</code> <code class="notion-inline-code">&lt;size&gt;</code> <code class="notion-inline-code">&lt;seg&gt;</code>，到底如何转化成bbox和mask。</div><div class="notion-text notion-block-3703c18ff81c808f8a8fc9a85eef4bda">这里有一个关键点：LM head只负责预测“下一个token是什么”。但<code class="notion-inline-code">&lt;coord&gt;</code>、<code class="notion-inline-code">&lt;size&gt;</code>和<code class="notion-inline-code">&lt;seg&gt;</code>并不是普通文本内容，它们更像是调用不同head的触发器。</div><div class="notion-text notion-block-3703c18ff81c8049ae07f161c4208c9b">代码中对应的模块集中在：</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c803f9de7db30d395e4ca" data-id="3703c18ff81c803f9de7db30d395e4ca"><span><div id="3703c18ff81c803f9de7db30d395e4ca" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c803f9de7db30d395e4ca" title="2.3.1 Coord / Size Head"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.3.1 Coord / Size Head</span></span></h4><div class="notion-text notion-block-3703c18ff81c803ab769c8780f5db18c">很多自回归视觉模型会把坐标当成普通token集成到词表中。比如定义一组0到999的token表示坐标：</div><div class="notion-text notion-block-3703c18ff81c8064b357f0fed10d7acf">这样就可以把检测、分割问题转化为token预测问题，再通过后处理解析坐标。</div><div class="notion-text notion-block-3703c18ff81c80489f8ff70e36b5b734">上述方案实现简单，也在很多模型上得到了验证，例如检测领域的Pix2Seq，以及一些OCR/文档解析模型中的location token设计（如MinerU）。但它有几个明显弊端：</div><ol start="1" class="notion-list notion-list-numbered notion-block-3703c18ff81c80e692dac57461f5b440" style="list-style-type:decimal"><li>坐标token的embedding相对离散，坐标的空间位置关系只能被隐式学习。</li></ol><ol start="2" class="notion-list notion-list-numbered notion-block-3703c18ff81c80d0818cc4402d57173c" style="list-style-type:decimal"><li>坐标精度受bin数量限制，bin太少会粗糙，bin太多又会增加词表或序列建模难度。</li></ol><ol start="3" class="notion-list notion-list-numbered notion-block-3703c18ff81c80b1b0a6f71368970d1a" style="list-style-type:decimal"><li>如果把mask边界也转成token序列，dense output会非常长，不适合高实例数场景。</li></ol><div class="notion-text notion-block-3703c18ff81c80199a05d5939e16db36">Falcon Perception采用的是另一种折中方案：序列里仍然只生成<code class="notion-inline-code">&lt;coord&gt;</code>和<code class="notion-inline-code">&lt;size&gt;</code>这两个special token，但真实的连续坐标由额外的bbox head预测。</div><div class="notion-text notion-block-3703c18ff81c8036bdaff4ac1a81cbb1">模型初始化时可以看到四个相关模块：代码位置：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L317">https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L317</a></div><div class="notion-text notion-block-3703c18ff81c80b192a9f90c03e23450">这四个模块分成两组：</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80c6a477d0fd6df43818"><li>encoder：把已经得到的连续坐标/尺寸编码回Transformer输入流中。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80f6bac7cce474b5d3a3"><li>decoder：从Transformer hidden state中预测连续坐标/尺寸。</li></ul><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c805193d6e6e15e44aa48" data-id="3703c18ff81c805193d6e6e15e44aa48"><span><div id="3703c18ff81c805193d6e6e15e44aa48" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c805193d6e6e15e44aa48" title="Fourier Encoder：把连续值注入token embedding"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">Fourier Encoder：把连续值注入token embedding</span></span></h4><div class="notion-text notion-block-3703c18ff81c8047af37fb1efcbe95ae">训练时，target序列里有<code class="notion-inline-code">&lt;coord&gt;</code>和<code class="notion-inline-code">&lt;size&gt;</code>位置，但模型不能只看到一个静态special token embedding。否则后续的<code class="notion-inline-code">&lt;size&gt;</code>和<code class="notion-inline-code">&lt;seg&gt;</code>并不知道前一步真实预测/标注的中心点在哪里。换句话说，常规的做法<code class="notion-inline-code">&lt;coord&gt;</code>对应的是这个token得embedding，但无法提现示例的信息，因此需要用真实的坐标信息的embedding来对它进行替换。也就是说，<code class="notion-inline-code">&lt;coord&gt;</code>可以视作是一个placeholder。</div><div class="notion-text notion-block-3703c18ff81c8045ba14c8454bcc4961">那么，如何构建这个坐标信息的embedding呢？</div><div class="notion-text notion-block-3703c18ff81c8046a0e8ce60f77112bd">坐标信息的编码：</div><div class="notion-text notion-block-3703c18ff81c80abaa7dc5bf72393af2">Falcon Perception采用Fourier feature mapping的方式构建坐标的embedding （Nerf中的做法），核心思路是用连续空间的频率模式来表示坐标。</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c809f9283d6c7485cef76"><li>x为坐标信息</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80c7a8aad537c1724176"><li>B为随机频率矩阵</li></ul><div class="notion-text notion-block-3703c18ff81c80df9b7ac68ab6008bb4">这个思路将坐标学习转化为了频率模式匹配问题。</div><div class="notion-text notion-block-3703c18ff81c80e989ffc5cdde177528">代码中对应 <a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L34%EF%BC%9A">https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L34：</a></div><div class="notion-text notion-block-3703c18ff81c80fea3e1dfa1d09ac553">坐标信息解码：</div><div class="notion-text notion-block-3703c18ff81c80569fcdd1daf52104d9">推理时，模型先通过LM head采样下一个token：</div><div class="notion-text notion-block-3703c18ff81c804e8ae1d51f690484d9">如果采样出的token是<code class="notion-inline-code">&lt;coord&gt;</code>或<code class="notion-inline-code">&lt;size&gt;</code>，则调用：</div><div class="notion-text notion-block-3703c18ff81c80f8b8d4e9e88fa45b76">内部会将当前hidden state送入两个MLP decoder：</div><div class="notion-text notion-block-3703c18ff81c80ffa167f801a38b4d7c"><a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L705">https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L705</a></div><div class="notion-text notion-block-3703c18ff81c80e2912dcb993c9ad00c">坐标<code class="notion-inline-code">x, y</code>按归一化bin预测。尺寸<code class="notion-inline-code">h, w</code>则不是线性bin，而是经过<code class="notion-inline-code">process_sizes</code>映射到log2尺度：</div><div class="notion-text notion-block-3703c18ff81c808a8256c508db22def3">这样做比较合理，因为目标尺寸天然是跨尺度分布的：小目标之间的差异需要更细粒度，大目标则更关注相对比例。</div><div class="notion-text notion-block-3703c18ff81c80f0b57af3f36014e8ab">得到<code class="notion-inline-code">xy</code>或<code class="notion-inline-code">hw</code>后，它们会在下一步forward时重新编码回输入序列：</div><div class="notion-text notion-block-3703c18ff81c80399aa2e3e0ce0125b1">完整链路：</div><div class="notion-text notion-block-3703c18ff81c806cb4dbcce863a519f1">Step1: LM head预测 &lt;coord&gt;</div><div class="notion-text notion-block-3703c18ff81c80c18511dfedbd153fa4">Step2: 将hidden state输入到coord decoder中，输出中心点</div><div class="notion-text notion-block-3703c18ff81c80abaf59c1b96c90d47e">Step3: Fourier encoder将中心点编码为embedding，写入下一步输入</div><div class="notion-text notion-block-3703c18ff81c806f9dd9c4fe40e4dd12">Step4: LM head预测 &lt;size&gt;</div><div class="notion-text notion-block-3703c18ff81c80d3b970d4c382c12b17">Step5: 将hidden state输入到size decoder中，输出宽高</div><div class="notion-text notion-block-3703c18ff81c80fe9fddf1bed6cb594b">Step6: Fourier encoder将宽高编码为embedding，写入下一步输入</div><div class="notion-text notion-block-3703c18ff81c8004abfac5ca730cba63">Step7: LM head预测 &lt;seg&gt;</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c807ba17dcb6f294614db" data-id="3703c18ff81c807ba17dcb6f294614db"><span><div id="3703c18ff81c807ba17dcb6f294614db" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c807ba17dcb6f294614db" title="2.3.2 Segmentation Head"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.3.2 Segmentation Head</span></span></h4><div class="notion-text notion-block-3703c18ff81c80169707d83b634ce856">&lt;seg&gt;也可以理解为placeholder。需要引入额外的head提取真实的mask。只用单个token的hidden state来生成high-resolution的mask是比较难的。Falcon Perception采用了一种dot product的方式来生成mask。</div><div class="notion-text notion-block-3703c18ff81c80d789aad3e0168d962e">具体来说，在prefill阶段，将image patch token位置的所有hidden state与原图送入到一个模型中，获得一个feature map <code class="notion-inline-code">V* ∈ R^{H×W×d}</code> （<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://arxiv.org/abs/2510.12764">AnyUP</a>这篇paper的做法）。代码位置：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/batch_inference.py#L248">https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/batch_inference.py#L248</a></div><div class="notion-text notion-block-3703c18ff81c802d8455f40e0bbadfd5">生成阶段中，若当前生成的token为&lt;seg&gt;, 随后对其hidden state进行linear transform后与上述feature map做点乘，从而得到mask，代码位置：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/aux_output.py#L210">https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/aux_output.py#L210</a></div><div class="notion-text notion-block-3703c18ff81c8027a1eac16b0188740d">
相比Mask2Former需要用复杂的 Hungarian matching 来解决 instance ambiguity问题。Falcon Perception设计的&lt;coord&gt;&lt;size&gt;&lt;seg&gt;链路的指代是清晰的。</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3703c18ff81c80b99043e90efc84931a" data-id="3703c18ff81c80b99043e90efc84931a"><span><div id="3703c18ff81c80b99043e90efc84931a" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80b99043e90efc84931a" title="2.4 如何训练"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.4 如何训练</span></span></h3><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c80babde6ef824941824c" data-id="3703c18ff81c80babde6ef824941824c"><span><div id="3703c18ff81c80babde6ef824941824c" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80babde6ef824941824c" title="2.4.1 训练目标概览"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.4.1 训练目标概览</span></span></h4><div class="notion-text notion-block-3703c18ff81c8024a30fd6f4227ea24c">理解完上面的结构后，训练逻辑其实就比较自然了：Falcon Perception不是把dense perception完全变成普通语言建模，而是采用“自回归token + speaial head解码”的混合建模。</div><div class="notion-text notion-block-3703c18ff81c80449001cc43b374a5d9">换句话说，Transformer主干按next-token prediction训练；但当序列走到<code class="notion-inline-code">&lt;coord&gt;</code>、<code class="notion-inline-code">&lt;size&gt;</code>、<code class="notion-inline-code">&lt;seg&gt;</code>这些位置时，模型还会额外接收对应的坐标、尺寸和mask监督。</div><div class="notion-text notion-block-3703c18ff81c80618d73f10723498007">目标函数有下面几部分构成</div><div class="notion-text notion-block-3703c18ff81c803fae23cdc5ef0dffee">
其中：</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8085b5f9ca2f54c2d47c"><li>：普通token的cross entropy，包括文本token和&lt;coord&gt;、&lt;size&gt;、&lt;seg&gt;`这些special token的生成。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c801ab361fd27bdfc2e31"><li>：坐标head的cross entropy。代码中coord_decoder输出2 x Nbins个logits，默认每个维度2048个bin。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80109c36ff3f37fe5689"><li>：尺寸head的cross entropy。注意：尺寸不是线性空间，而是log2尺度上的bin</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c805a8cb9cdb1b64b37c2"><li>：mask监督，使用focal loss + dice loss。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8099b76fdc1ce0e51f98"><li>：主要包括蒸馏/特征对齐相关loss，比如Gram feature alignment。</li></ul><div class="notion-text notion-block-3703c18ff81c8033b158caee72b16349">可见，坐标和尺寸虽然最终是连续值，但训练时仍然转成bin分类问题；mask则没有被转成token序列，而是用像素级loss直接监督。这个设计避免了把高维dense output硬塞进语言序列。</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-3703c18ff81c8026b624d7af11dad6f1" data-id="3703c18ff81c8026b624d7af11dad6f1"><span><div id="3703c18ff81c8026b624d7af11dad6f1" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c8026b624d7af11dad6f1" title="2.4.2 更多的训练细节"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.4.2 更多的训练细节</span></span></h4><div class="notion-text notion-block-3703c18ff81c8029ae14fdfa93e6e2de">（一）multi-teacher distillation</div><div class="notion-text notion-block-3703c18ff81c800ea540f39abc2840a4">teacher模型有2个：DINOv3-ViT-H和SigLIP2-So400m</div><div class="notion-text notion-block-3703c18ff81c80b186acc51374a44b7d">为了训练好early-fusion perception models，作者使用 multi-teacher distillation pipeline 初始化模型权重。其动机是利用不同 vision backbones 的优势。</div><div class="notion-text notion-block-3703c18ff81c804cbb83e3ca089741af">论文对这一步没有公布太多的细节。有几点是明确的：</div><ul class="notion-list notion-list-disc notion-block-3703c18ff81c806aaf78f30f8d101f5f"><li>这个蒸馏不是简单的logits蒸馏。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c80ffa3efce740e4685e0"><li>蒸馏的核心目的之一是为了让模型能够继承DINOv3-ViT-H强大的提取local feature的能力，这对segmentation很重要。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8038a824ee92a8983f5a"><li>目的之二期望继承SigLIP-So400m图文embedding对齐的特性，便于 open-vocabulary expression understanding。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c807486e5e36651a66335"><li>蒸馏的数据： OpenLVD200m , 900 万 high-resolution scraped images,1100 万 SAM dataset images, 500 万 documents。</li></ul><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8049880bda38fb6aa0bc"><li>训练流程：</li><ul class="notion-list notion-list-disc notion-block-3703c18ff81c8049880bda38fb6aa0bc"><li>multi-resolution stage，最高到 <code class="notion-inline-code">1024×1024</code> pixels，约 200k steps</li><li>Muon为优化器</li><li>4*8-A100 GPU nodes，采用sequence packing</li><li>local batch size 为 6，最大 sequence length 为 4096 tokens</li></ul></ul><div class="notion-text notion-block-3703c18ff81c80c2ad34f0dca8c917be">（二）优化器的选择</div><div class="notion-text notion-block-3703c18ff81c8005aa47f6b1d1d68ed7">Falcon Perception对Muon和AdamW这两个优化器做了消融实验。在perception这个场景下，Moun优化器有明显的优势</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c80248112c91b956cf8bb"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3Abf4ca0f4-4752-4e41-9b3e-8d227f8c438b%3Aoptimizer_compare.png?table=block&amp;id=3703c18f-f81c-8024-8112-c91b956cf8bb&amp;t=3703c18f-f81c-8024-8112-c91b956cf8bb" alt="notion image" loading="lazy" decoding="async"/></div></figure><div class="notion-text notion-block-3703c18ff81c807fad94fcafb820e7b7">（三）masking order</div><div class="notion-text notion-block-3703c18ff81c80ccba4bc4b264cd8ae4">当有多个实例时，该如何设计target sequence的顺序呢？（本质上就是：一个包含多实例的图片，有多条合理的轨迹）。作者比较了random、按照size从大到小、按照raster order（自上而下、自左到右）。作者发现raster order的方式效果更好（直觉上这个方法先验的约束了轨迹，限制了模型需要拟合的轨迹空间）。</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c8059b0a5eece4f5631f9"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3Aaee09a21-0b18-447b-a20c-42b4a977cd1c%3Amasking_order.png?table=block&amp;id=3703c18f-f81c-8059-b0a5-eece4f5631f9&amp;t=3703c18f-f81c-8059-b0a5-eece4f5631f9" alt="notion image" loading="lazy" decoding="async"/></div></figure><div class="notion-text notion-block-3703c18ff81c807fbbe7ed33a4fa5b22">更多的细节请参考原论文。</div><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3703c18ff81c80fcb634d15d0dc71e70" data-id="3703c18ff81c80fcb634d15d0dc71e70"><span><div id="3703c18ff81c80fcb634d15d0dc71e70" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80fcb634d15d0dc71e70" title="3 结果"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3 结果</span></span></h2><div class="notion-text notion-block-3703c18ff81c80009c19fa603dc35920">结果部分我觉得主要看三个点。</div><div class="notion-text notion-block-3703c18ff81c80d2b0f9d8339854693d">第一，Falcon Perception在SA-Co上mask quality比较强，Macro-F1达到68.0，高于SAM3的62.3，但 presence calibration 明显弱于 SAM3，例如 Average MCC 是 0.64 vs 0.82。</div><div class="notion-text notion-block-3703c18ff81c8024bee6f818df6bc6c0">第二，在PBench(论文提出的benchmark)上优势明显。PBench更强调attribute、OCR、spatial relation和dense场景，这些任务都需要更强的视觉-语言-空间联合建模。论文中Falcon Perception平均57.0，SAM3为44.4；Dense场景下是72.6 vs 58.4。</div><div class="notion-text notion-block-3703c18ff81c80c59abcd33f371f5fad">第三，sampling对结果提升很大（类似语言模型中do-sample的做法，同个数据多跑几次，非贪心采样）。SA-Co上cgF1可以从baseline的34.7提升到Pass@8的54.3。这说明模型分布里其实包含不少正确答案，只是greedy decoding未必总能选中。</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c8045a1efd8fd732ed659"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A62b52a48-b27c-4e2c-8544-6a4b48ea4714%3Aresult_q.png?table=block&amp;id=3703c18f-f81c-8045-a1ef-d8fd732ed659&amp;t=3703c18f-f81c-8045-a1ef-d8fd732ed659" alt="notion image" loading="lazy" decoding="async"/></div></figure><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3703c18ff81c80d6be51d3ada73b9236"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A237748c9-abab-4a9b-944d-e66b3147246a%3Aresult.png?table=block&amp;id=3703c18f-f81c-80d6-be51-d3ada73b9236&amp;t=3703c18f-f81c-80d6-be51-d3ada73b9236" alt="notion image" loading="lazy" decoding="async"/></div></figure><div class="notion-text notion-block-3703c18ff81c80a093bce7da91cdc54b">作者额外研究了将Falcon-Perception架构应用到OCR领域中，提出了<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://huggingface.co/tiiuae/Falcon-OCR">Falcon-OCR-0.3B</a>。笔者实测下来，效果还有提升空间，可能是训练的数据规模不够大。</div><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3703c18ff81c80518952d8638cb6b9f8" data-id="3703c18ff81c80518952d8638cb6b9f8"><span><div id="3703c18ff81c80518952d8638cb6b9f8" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3703c18ff81c80518952d8638cb6b9f8" title="4 小结"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">4 小结</span></span></h2><div class="notion-text notion-block-3703c18ff81c80859adbc7db63fcd579">本文系统梳理了Falcon-Perception的技术原理。整体上，论文中的工作量是非常扎实的，不论从代码上、数据上还是实验上，值得一读。</div><div class="notion-blank notion-block-3703c18ff81c80808914c745e37d8587"> </div></main></div>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[RL学习小结 (006)：PPO原理剖析]]></title>
            <link>http://www.myhz0606.com/article/ppo</link>
            <guid>http://www.myhz0606.com/article/ppo</guid>
            <pubDate>Tue, 12 May 2026 16:00:00 GMT</pubDate>
            <description><![CDATA[本文从PPO提出的motivation出发，系统介绍了PPO-Penalty和PPO-CLIP两种变体。总体来说，PPO 用更简单的方式近似TRPO的trust region约束，在保证策略更新稳定性的同时，大幅降低计算复杂度。]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-3293c18ff81c80d5947cf72676a47ab3"><div class="notion-viewport"></div><div class="notion-collection-page-properties"></div><div class="notion-text notion-block-35f3c18ff81c80cb9e15cf07afec1830">前置阅读：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="http://myhz0606.com/article/trpo">http://myhz0606.com/article/trpo</a></div><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-35f3c18ff81c80628c27fd0c5730503a" data-id="35f3c18ff81c80628c27fd0c5730503a"><span><div id="35f3c18ff81c80628c27fd0c5730503a" class="notion-header-anchor"></div><a class="notion-hash-link" href="#35f3c18ff81c80628c27fd0c5730503a" title="1 前言"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">1 前言</span></span></h3><div class="notion-text notion-block-35f3c18ff81c806c931fe4b0ad479fb4">回顾一下TRPO的链路：传统的策略梯度，由于目标函数的非凸性，无法很好的确定更新的步长。TRPO的核心目标是：沿着性能提升的方向，走最大的一步。为此TRPO算法先从性能提升引理出发，结合新旧状态分布近似、importance sampling技巧改写了优化目标：</div><div class="notion-text notion-block-35f3c18ff81c8060a4d9e4df6554b2e8">并证明了改写的优化目标是原始优化目标的下界。</div><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c805380f1e1ad8e9f5ae4"><li></li></ul><div class="notion-text notion-block-35f3c18ff81c80488caaf57f03fb52e2">通过优化<!-- -->能保证<!-- -->单调不减。</div><div class="notion-text notion-block-35f3c18ff81c80daa7d1da7339372d04">但实践中，上面优雅的理论式子面临两个问题：</div><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c80b395b5d965a0c410a3"><li>单个iteration只有有限条轨迹，难以遍历所有的状态找到<!-- -->。</li></ul><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c80b88e59dc93281b7706"><li>惩罚因子<!-- -->过大，使得KL divergence项的梯度过大，模型更新过于保守。</li></ul><div class="notion-text notion-block-35f3c18ff81c802082a9ce34941e4ac6">为此TRPO做了2个工程上的优化：</div><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c80b5a666f2de854717c2"><li>利用拉格朗日对偶性，将惩罚项转为约束。</li></ul><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c8070adc1fb2cc6ce766f"><li>用average KL而非max。具体而言，我们不要求<!-- -->，而是要求<!-- -->。</li></ul><div class="notion-text notion-block-35f3c18ff81c80118f5fc61f57bc4cac">由此可得到TRPO的工程化后的优化目标</div><div class="notion-text notion-block-35f3c18ff81c809d8242d91368e8eec1">等同于优化</div><blockquote class="notion-quote notion-block-35f3c18ff81c8007a6d5cf5b0ce8243c"><div>⚠️注意符号区分：<!-- -->是advantage function，<!-- -->是trust region radius</div></blockquote><div class="notion-text notion-block-35f3c18ff81c809c8f10e03425af2281">最后通过对目标函数、约束条件分别进行泰勒一阶近似和二阶近似后，将一个复杂的非线性优化问题转化为一个经典的二次约束线性优化问题，用拉格朗日乘子法求得解析解，最后用共轭梯度法进行求解。由于前面的一些工程trick和近似处理在一定程度上破坏了单调不减的理论保证，于是结合了回溯的linear search操作。</div><div class="notion-text notion-block-35f3c18ff81c8091a0dbea0771243fec">总体而言：TRPO数学理论上非常优雅，它系统的解决了强化学习中策略更新的方向和步长问题。但从上面的简短回顾不难看出，TRPO算法的工程实现还是偏于繁杂：每一次策略更新都需进行CG算法得到更新方向，再用linear search确定更新步长。</div><div class="notion-text notion-block-35f3c18ff81c8099bd37c9966f06cb5d">这就是PPO提出核心的motivation：在尽可能的满足策略性能单调不减下，解决TRPO计算复杂度过高的问题。</div><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-35f3c18ff81c80ccb7f0e7b466fe2fb6" data-id="35f3c18ff81c80ccb7f0e7b466fe2fb6"><span><div id="35f3c18ff81c80ccb7f0e7b466fe2fb6" class="notion-header-anchor"></div><a class="notion-hash-link" href="#35f3c18ff81c80ccb7f0e7b466fe2fb6" title="2 PPO算法"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2 PPO算法</span></span></h3><div class="notion-text notion-block-35f3c18ff81c80459d32f3b820f3d790">作者给出了PPO-Penalty和PPO-CLIP两种变体。</div><h4 class="notion-h notion-h3 notion-h-indent-1 notion-block-35f3c18ff81c8005874dc7c76822e5a0" data-id="35f3c18ff81c8005874dc7c76822e5a0"><span><div id="35f3c18ff81c8005874dc7c76822e5a0" class="notion-header-anchor"></div><a class="notion-hash-link" href="#35f3c18ff81c8005874dc7c76822e5a0" title="2.1 PPO-Penalty"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title"><b>2.1 PPO-Penalty</b></span></span></h4><div class="notion-text notion-block-35f3c18ff81c80088d11ecf70a076f92">从TRPO的推导我们得出通过优化下式能保证新的策略性能单调不减</div><div class="notion-text notion-block-35f3c18ff81c80c6952af4f126ebf632">PPO沿用了TRPO用averge KL来代替max KL得工程trick,但没有沿用拉格朗日对偶性将惩罚转为约束。作者是引入了一个超参数<!-- -->来取代的惩罚系数<!-- -->。</div><div class="notion-text notion-block-35f3c18ff81c8011b38bcf2883737f2f">由此，可得PPO-Penalty的目标函数</div><div class="notion-text notion-block-35f3c18ff81c80b2acf2f57e4d10a785">超参数<!-- -->的定义是个难题，作者采取了一种动态修正的策略，首先令</div><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c804fbd64db6b6f9b5968"><li>若<!-- -->，说明策略更新较为保证，KL散度的期望低于预设值，为了加快学习，可以减少惩罚力度，</li></ul><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c801290f7f491dc0776a9"><li>若<!-- -->，说明策略变化太剧烈，KL散度的期望超过安全区。为了稳定学习，增大惩罚力度，</li></ul><ul class="notion-list notion-list-disc notion-block-35f3c18ff81c8069909add55d96920cb"><li>当<!-- -->，说明新旧策略的KL散度在安全区，无需调整<!-- -->。</li></ul><h4 class="notion-h notion-h3 notion-h-indent-1 notion-block-35f3c18ff81c80bfb740e7a6c49926a7" data-id="35f3c18ff81c80bfb740e7a6c49926a7"><span><div id="35f3c18ff81c80bfb740e7a6c49926a7" class="notion-header-anchor"></div><a class="notion-hash-link" href="#35f3c18ff81c80bfb740e7a6c49926a7" title="2.2 PPO-clip"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title"><b>2.2 PPO-clip</b></span></span></h4><div class="notion-text notion-block-35f3c18ff81c8045b667e8df04a8dde2">PPO-CLIP是PPO算法的另一种变体（也是主流的变体），它采用悲观估计（lower bound optimization）的策略来定义tust-region。其形式为：</div><div class="notion-text notion-block-35f3c18ff81c800991a0e5bb05ef7324">分几种情况讨论上述目标函数的直觉意义</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-35f3c18ff81c802f89a5d52f76a4afb9"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A593da3a8-b0ce-439f-88b4-0ea281d18bc5%3Aimage.png?table=block&amp;id=35f3c18f-f81c-802f-89a5-d52f76a4afb9&amp;t=35f3c18f-f81c-802f-89a5-d52f76a4afb9" alt="notion image" loading="lazy" decoding="async"/></div></figure><table class="notion-simple-table notion-block-35f3c18ff81c803ab7b4efe71688f056"><tbody><tr class="notion-simple-table-row notion-simple-table-header-row notion-block-35f3c18ff81c8059ac9fcda252b5f18b"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell">的符号</div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">has gradient?</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">直觉解释</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c802ba12bf73a0fbab2ab"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">True</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">在目标区间内，说明新旧策略表现接近，正常优化。</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c80bd9fa3c1247868832d"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">True</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">在目标区间内，说明新旧策略表现接近，正常优化。</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c808a8756d1752fe8d8cf"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">True</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell"> <!-- -->。<!-- -->说明这是一个能带来正向收益的好动作，但新策略居然不愿意接受这个动作（和就策略的比值小于<!-- -->）,因此需要优化新策略。</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c8040babadff6aedac181"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">False</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">坏动作，新策略接受的概率较大幅度低于旧策略，说明在这个场景下新策略已经优化的足够好，无需再优化了。</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c80bf9077d469c6861987"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">False</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">好动作，新策略接受的概率较大幅度高于旧策略，说明在这个场景下新策略已经优化的足够好，无需再优化了。</div></td></tr><tr class="notion-simple-table-row notion-block-35f3c18ff81c80d89617cc45f42c11fc"><td class="" style="width:131.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:110.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:159.6640625px"><div class="notion-simple-table-cell"></div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">True</div></td><td class="" style="width:302.46875px"><div class="notion-simple-table-cell">。<!-- -->说明这是一个负向收益的动作，应当抑制。但新策略确非常倾向接受这个动作(和旧策略的比值大于<!-- -->)，因此需要优化新策略。</div></td></tr></tbody></table><div class="notion-text notion-block-35f3c18ff81c80c8a589ef4d927ab2f0">虽然相比PPO-Penalty,PPO-CLIP没有严格的理论推导，但在实践中，PPO-CLIP不论从性能还是效果都有优势。</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-35f3c18ff81c800eb30cf868d922653c"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A2b927ce9-7d68-44ee-9b49-88f22a92af09%3Aimage.png?table=block&amp;id=35f3c18f-f81c-800e-b30c-f868d922653c&amp;t=35f3c18f-f81c-800e-b30c-f868d922653c" alt="notion image" loading="lazy" decoding="async"/></div></figure><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-35f3c18ff81c80148f6ceff0a83e0090"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3Aa46f082d-6fa7-40f4-8f09-35fb1ee6ce8d%3Aimage.png?table=block&amp;id=35f3c18f-f81c-8014-8f6c-eff0a83e0090&amp;t=35f3c18f-f81c-8014-8f6c-eff0a83e0090" alt="notion image" loading="lazy" decoding="async"/></div></figure><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-35f3c18ff81c807bbc57ce055b74f7ae" data-id="35f3c18ff81c807bbc57ce055b74f7ae"><span><div id="35f3c18ff81c807bbc57ce055b74f7ae" class="notion-header-anchor"></div><a class="notion-hash-link" href="#35f3c18ff81c807bbc57ce055b74f7ae" title="小结"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">小结</span></span></h3><div class="notion-text notion-block-35f3c18ff81c8066a03bc765f824ae1d">本文从PPO提出的motivation出发，系统介绍了PPO-Penalty和PPO-CLIP两种变体。总体来说，PPO 用更简单的方式近似TRPO的trust region约束，在保证策略更新稳定性的同时，大幅降低计算复杂度。由于笔者水平有限，若有不当之处，欢迎指出～</div><div class="notion-blank notion-block-35f3c18ff81c80f4b684fb2e0db3074c"> </div></main></div>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[《MinerU2.5-Pro》 技术小结]]></title>
            <link>http://www.myhz0606.com/article/mineru2d5_pro</link>
            <guid>http://www.myhz0606.com/article/mineru2d5_pro</guid>
            <pubDate>Tue, 14 Apr 2026 16:00:00 GMT</pubDate>
            <description><![CDATA[MinerU最近发布了一个更新的Pro模型。在模型架构不变的约束下，通过优化数据工程与训练策略，实现了显著的精度提升。]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-3433c18ff81c80b7a88ddb904daba031"><div class="notion-viewport"></div><div class="notion-collection-page-properties"></div><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3433c18ff81c80c0b7a5d163229219d7" data-id="3433c18ff81c80c0b7a5d163229219d7"><span><div id="3433c18ff81c80c0b7a5d163229219d7" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c80c0b7a5d163229219d7" title="1. 前言"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">1. 前言</span></span></h2><div class="notion-text notion-block-3433c18ff81c8015a311f20aaee98bda">MinerU最近发布了一个更新的Pro模型。在模型架构不变的约束下，通过优化数据工程与训练策略，实现了显著的精度提升。</div><div class="notion-text notion-block-3433c18ff81c801c8d69dd37fa350fbd">论文的核心动机在于：不同架构的SOTA的模型在面临困难样本（hard samples）时，往往呈现出相似的错误模式。这一现象表明，OCR 性能的瓶颈可能并不主要来自模型架构本身，而更可能受限于训练数据的质量与构成。MinerU在OmniDocbench中新增了hard sample，用于更有针对性的评估对于此类复杂场景的模型表现。</div><div class="notion-text notion-block-3433c18ff81c80f482dcd03cb312ac42">模型仓库：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B">https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B</a></div><div class="notion-text notion-block-3433c18ff81c80908f25cb378eec2a62">数据集地址：<a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://huggingface.co/datasets/opendatalab/OmniDocBench">https://huggingface.co/datasets/opendatalab/OmniDocBench</a></div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3433c18ff81c809aac55e1967ea4aeac"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3Aba5e888c-a928-4434-976a-a8efd81d05a4%3Aimage.png?table=block&amp;id=3433c18f-f81c-809a-ac55-e1967ea4aeac&amp;t=3433c18f-f81c-809a-ac55-e1967ea4aeac" alt="notion image" loading="lazy" decoding="async"/></div></figure><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3433c18ff81c800eac5eefe9b4a00d04" data-id="3433c18ff81c800eac5eefe9b4a00d04"><span><div id="3433c18ff81c800eac5eefe9b4a00d04" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c800eac5eefe9b4a00d04" title="2. 数据工程"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2. 数据工程</span></span></h2><div class="notion-text notion-block-3433c18ff81c801e89fac43f76f11bd4">MinerU的数据工程主要分为4个阶段</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c807eb204f3588cf6d402"><li>Diversity-and-Difficulty-Aware Sampling</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80199af7da36adcd586c"><li>Cross-Model Consistency Verification</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8050b4a9ca1c8fdf1cfd"><li>Judge-and-Refine pipeline</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8029a80edb90684d404b"><li>Targeted Expert Annotation</li></ul><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3433c18ff81c8045b637e11190e22da1"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A266f9480-44fe-43a2-b84f-ab381ed3cbcf%3Aimage_1775783331427.png?table=block&amp;id=3433c18f-f81c-8045-b637-e11190e22da1&amp;t=3433c18f-f81c-8045-b637-e11190e22da1" alt="notion image" loading="lazy" decoding="async"/></div></figure><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3433c18ff81c8051b698fbc6444e7738" data-id="3433c18ff81c8051b698fbc6444e7738"><span><div id="3433c18ff81c8051b698fbc6444e7738" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c8051b698fbc6444e7738" title="2.1 Diversity-and-Difficulty-Aware Sampling (DDAS)"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.1 Diversity-and-Difficulty-Aware Sampling (DDAS)</span></span></h3><div class="notion-text notion-block-3433c18ff81c8063bd58edb17fc96d62">核心目标：解决训练数据长尾分布的问题。具体而言，通常情况下，数据池的学术论文，单栏报告的占比较大，但嵌套表格，复杂公式排版，多栏排版的占比较低。这导致模型对此类低频的场景识别精度较差。下面看具体的做法。</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3433c18ff81c80598463fa07f7b521d6"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A0ddf3915-3abb-4337-829b-4b4d96aaac1d%3Aimage_1775783048466.png?table=block&amp;id=3433c18f-f81c-8059-8463-fa07f7b521d6&amp;t=3433c18f-f81c-8059-8463-fa07f7b521d6" alt="notion image" loading="lazy" decoding="async"/></div></figure><div class="notion-text notion-block-3433c18ff81c80a59090eb5a19aa3746"><b>stage1 Page-level sampling：</b></div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c808b9247cf7ccbd3b1a4"><li>提取所有PDF page的embedding（ViT base），并进行聚类；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c800896b5c1447f744a68"><li>对每个cluster进行<b>均匀采样</b>，获得初始的侯选集（保证分布覆盖）；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c809ca5ece5fb44e2b6e6"><li>对初始侯选集的每一个page通过CMCV获得复杂度标签；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80bb85f3f86094f5112e"><li>用复杂度标签的分布情况赋予cluster采样权重：</li><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80bb85f3f86094f5112e"><li>easy sample占比高→降权</li><li>difficulty分布多样→提权</li><li>invalid多（空白页、非目标语言）→ 过滤</li></ul></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c806ab30fd49ac4b2a140"><li>根据调整后的cluster权重，从原始数据池进行加权resample，扩展后选集；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c800093f2dda0481133ab"><li>并补充计算所有sample的复杂度标签。</li></ul><div class="notion-text notion-block-3433c18ff81c80099bb9c74b1e42b65e"><b>stage2 Element-level sampling：</b></div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c807ca635fbdf7ab8c80b"><li>将stage1得到的页面候选集用MinerU2.5和paddleOCR VL进行layout检测，获得math/table/text的block-level的数据；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c804c8ff2c5a56f787882"><li>分别对math/table/text进行聚类；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c805a94ace9bf751b4396"><li>执行block-level的CMCV获得element的difficulty的标签。</li></ul><div class="notion-text notion-block-3433c18ff81c80a49bc5f18ab186f2c5"><b>stage3 Sampling：</b> 通过stage1，stage2我们得到layout/table/text/math的数据，每一个数据点包含cluster-difficulty两种粒度的信息，现在对他们在cluster-difficulty的联合空间进行采样。</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8099ad7be17f1967da31"><li>在diversity维度（解决长尾问题）：</li><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8099ad7be17f1967da31"><li>大cluster进行下采样；</li><li>小cluster进行上采样；</li></ul></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80fb8530c74fd5a806a6"><li>在difficulty维度（提升训练信号的强度）</li><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80fb8530c74fd5a806a6"><li>提升medium/hard数据的权重</li></ul></ul><div class="notion-text notion-block-3433c18ff81c80128579f02883c097f4">最终获得task-balanced，high diversity，high informativeness的训练数据。</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3433c18ff81c80a4a849f584e75fe8a5" data-id="3433c18ff81c80a4a849f584e75fe8a5"><span><div id="3433c18ff81c80a4a849f584e75fe8a5" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c80a4a849f584e75fe8a5" title="2.2 Cross-Model Consistency Verification （CMCV）"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.2 Cross-Model Consistency Verification （CMCV）</span></span></h3><div class="notion-text notion-block-3433c18ff81c8070aa43fa2233bb79ca">DDAS策略依赖样本的复杂度，那么如何得到这个复杂度呢？这是CMCV回答的问题。</div><div class="notion-text notion-block-3433c18ff81c80adbe37d0163252cfcb">在过去，MinerU2.5的IMIC策略和paddle VL-1.5的UACS策略通过对单模型多次推理的一致性来度量这个复杂度。这个方法存在局限性：</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80f083a2c0d466eabdec"><li>这个复杂度主要反应模型的“盲点”，而非通用的复杂度</li></ul><div class="notion-text notion-block-3433c18ff81c800ca4a2f4d1f2df643b">对于模型盲点，我们可以通过多模型voting的方法来进行预标注。单对于通用的复杂case，可能所有模型的表现都不好，可能需要额外引入人工的干涉。
为了更好的刻画复杂度，作者引入CMCV的评估策略，核心思想是，通过多模型的预测差异性来度量这个复杂度</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8006a1b9c0c33721adaa"><li>若多个模型对同一个case的预测结果相近，说明是easy case。</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c800bade7e86e59d9af51"><li>若多个模型对同一个case的预测结果不相近，分歧越大，说明复杂度越高。</li></ul><blockquote class="notion-quote notion-block-3433c18ff81c80ba901fc7f157362ec7"><div>💡作者采用MinerU2.5， PaddleOCR-VL，Qwen3-VL-30B作为解析模型。</div><div class="notion-text notion-block-3433c18ff81c80cca4e2ca4b49f11e68">公式用CDM，表格用TEDS，文本用edit distance作为结果是否相近的度量</div></blockquote><div class="notion-text notion-block-3433c18ff81c80b88927fd3615fb59b5">作者划分了3个复杂度</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80eea26fdd954c7874b1"><li>Easy： MinerU2.5的结果至少与1个外部模型的结果相近。标注为easy的数据，预测的结果可以直接视作label。</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8036b4decb6372c094af"><li>Medium：MinerU2.5与外部模型的结果都不相近，外部模型的结果相近。可以将外部模型的标注作为此类case的伪标签。</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c800abe63d91c64447e21"><li>Hard：所有模型的结果都不相近。这类数据的价值非常高，但标注成本高。对此类case的处理方式需要Judge-and-Refine correction和外部专家的标注。</li></ul><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3433c18ff81c80ada594c0c595dfd860" data-id="3433c18ff81c80ada594c0c595dfd860"><span><div id="3433c18ff81c80ada594c0c595dfd860" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c80ada594c0c595dfd860" title="2.3 Judge-and-Refine Annotation Pipeline"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.3 Judge-and-Refine Annotation Pipeline</span></span></h3><div class="notion-text notion-block-3433c18ff81c80e2bba2eefe017c368c">这个阶段主要为了解决复杂度为hard数据的自动标注问题，核心思想是render-then-verify。具体做法：</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80379a75ec5001a98443"><li>将识别结果渲染成图片；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c804ab2d7fa247740bb24"><li>将原始图片和渲染的图片送入到Qwen3-VL-235B中进行judge and refine；（这样做的动机在于若将图和预测序列送入到模型，会存在跨模态信息的gap，都转为视觉空间有助于模型捕获差异）</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80878f73fa04eac93b83"><li>重复上述两个过程，直至完成停止条件。
对于Judge-and-Refine也无法修复的case转入到2.4专家标注。</li></ul><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-3433c18ff81c806083c4f6d2d8129331" data-id="3433c18ff81c806083c4f6d2d8129331"><span><div id="3433c18ff81c806083c4f6d2d8129331" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c806083c4f6d2d8129331" title="2.4 Targeted Expert Annotation"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2.4 Targeted Expert Annotation</span></span></h3><div class="notion-text notion-block-3433c18ff81c80f798eecf530f61ca8e">在经过 DDAS 采样、CMCV 难度分层以及 Judge-and-Refine 修正之后，Easy 和 Medium 样本已经具备可靠标注，但仍有一部分 Hard 样本超出了自动修正能力。针对这些样本，我们引入专家人工标注以保证最终质量。</div><div class="notion-text notion-block-3433c18ff81c80e88521eddfb0f83366">标注预算的分配基于 Judge-and-Refine 的中间输出结果进行优先级排序：</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8035bec2cff38317607e"><li>Judge 置信度高但 Refine 修正不确定的样本优先级最高。此类样本的错误位置已由 Judge 定位，标注人员只需进行局部修正，从而最大化标注效率。</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c801a8cd9d74994cc9992"><li>优先覆盖当前模型最薄弱的子任务类别。这样可以在有限标注预算下最大化对整体性能的边际提升</li></ul><div class="notion-text notion-block-3433c18ff81c8091b474f93405384224">人工标注采用“AI 预标注 + 专家审核与修正”的工作流程。在预标注阶段使用 Gemini 3 Pro。此外，还通过自动化质量评估工具进一步保证标注一致性。</div><div class="notion-text notion-block-3433c18ff81c808a828efc31a06eb7b7">最终，该数据引擎产出一个分层数据集：</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c8008ab6ccb8950024a33"><li>约65.5M条 Easy 和 Medium 样本，通过 CMCV 自动标注，用于 Stage 1 预训练；</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80f0835ee17e5f270bed"><li>19.2 万条专家标注的 Hard 样本，用于 Stage 2 微调以及 Stage 3 的 GRPO 对齐。</li></ul><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-3433c18ff81c80ddaaf9d71db425cd72" data-id="3433c18ff81c80ddaaf9d71db425cd72"><span><div id="3433c18ff81c80ddaaf9d71db425cd72" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3433c18ff81c80ddaaf9d71db425cd72" title="3. 训练策略"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3. 训练策略</span></span></h2><div class="notion-text notion-block-3433c18ff81c80019911d879b73836b0">采用3阶段渐进式训练范式。模型架构依旧沿用MinerU2.5:</div><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80008142df15466af75c"><li>encoder: NaViT 675M</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c808187c5e78f35b96dc8"><li>decoder: Qwen2-0.5B
初始权重迁移MinerU2.5。
具体的训练配置见下表：</li></ul><div class="notion-text notion-block-3433c18ff81c809bb496d03cf60769cc">
有几点需要注意的：</div><figure class="notion-asset-wrapper notion-asset-wrapper-image notion-block-3433c18ff81c8007b3eed24ab459ea63"><div style="position:relative;display:flex;justify-content:center;align-self:center;width:100%;max-width:100%;flex-direction:column;height:100%"><img style="object-fit:cover" src="https://www.notion.so/image/attachment%3A4253089b-05b7-4b04-80ed-3d8607645d24%3Aimage_1776129879115.png?table=block&amp;id=3433c18f-f81c-8007-b3ee-d24ab459ea63&amp;t=3433c18f-f81c-8007-b3ee-d24ab459ea63" alt="notion image" loading="lazy" decoding="async"/></div></figure><ul class="notion-list notion-list-disc notion-block-3433c18ff81c801d807ed0c981598f14"><li>65.5M训练数据的分布如下表。</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c80f38b3dd3c678d297c5"><li>stage2中多了一个3.9M得数据集。这是从65.5M条数据中采样得到。避免只在少量样本微调导致灾难性遗忘（catastrophic forgetting）。采样的规则设计中保证了hard sample的占比，一次强化模型在困难场景的表现</li></ul><ul class="notion-list notion-list-disc notion-block-3433c18ff81c802d9f86d8c8c1fcd244"><li>GRPO的reward设计与评估指标一致。即文本识别的reward为编辑距离；公式识别的reward为CDM；表格为TEDS；layout为IoU。</li></ul><table class="notion-simple-table notion-block-3433c18ff81c801ab5c8d1b8b12ab59d"><tbody><tr class="notion-simple-table-row notion-simple-table-header-row notion-block-3433c18ff81c803b977fd9ca23cb4554"><td class="" style="width:120px"><div class="notion-simple-table-cell">task</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">count</div></td></tr><tr class="notion-simple-table-row notion-block-3433c18ff81c8007b9c5dce57602c12e"><td class="" style="width:120px"><div class="notion-simple-table-cell">text recognition</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">21M</div></td></tr><tr class="notion-simple-table-row notion-block-3433c18ff81c802188a7d2a0af86cc3e"><td class="" style="width:120px"><div class="notion-simple-table-cell">layout analysis</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">14M</div></td></tr><tr class="notion-simple-table-row notion-block-3433c18ff81c8035ae32d79df7cd59d7"><td class="" style="width:120px"><div class="notion-simple-table-cell">formula recognition</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">13M</div></td></tr><tr class="notion-simple-table-row notion-block-3433c18ff81c80b28bc9c4ab9143dd8f"><td class="" style="width:120px"><div class="notion-simple-table-cell">table recognition</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">11.5M</div></td></tr><tr class="notion-simple-table-row notion-block-3433c18ff81c8006a8d0cfd8f7a2192e"><td class="" style="width:120px"><div class="notion-simple-table-cell">Other (chats, etc.)</div></td><td class="" style="width:120px"><div class="notion-simple-table-cell">6M</div></td></tr></tbody></table></main></div>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[RL学习小结 (001)： 基本概念、贝尔曼方程]]></title>
            <link>http://www.myhz0606.com/article/RL_001</link>
            <guid>http://www.myhz0606.com/article/RL_001</guid>
            <pubDate>Sat, 30 Aug 2025 16:00:00 GMT</pubDate>
            <description><![CDATA[本文系统介绍了强化学习的核心概念，包括状态、动作、策略、奖励和回报等基本术语；随后展示了如何使用马尔可夫决策过程（MDP）对强化学习问题进行数学建模；最后推导了状态值函数和动作值函数的贝尔曼方程。]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-2603c18ff81c805699f6d1b78c40a910"><div class="notion-viewport"></div><div class="notion-collection-page-properties"></div><div class="notion-text notion-block-2613c18ff81c801ca3cfdfd2b2955052">参考教材：
1. <a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf">《Reinforcement Learning: An Introduction》</a>
2. <a target="_blank" rel="noopener noreferrer" class="notion-link" href="https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning">《Mathematical Foundations of Reinforcement Learning</a></div><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-2613c18ff81c8033ac70ecd917b76ab5" data-id="2613c18ff81c8033ac70ecd917b76ab5"><span><div id="2613c18ff81c8033ac70ecd917b76ab5" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8033ac70ecd917b76ab5" title="1 强化学习基本概念"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">1 强化学习基本概念</span></span></h2><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-2613c18ff81c8021a58aef79ef8e1c7c" data-id="2613c18ff81c8021a58aef79ef8e1c7c"><span><div id="2613c18ff81c8021a58aef79ef8e1c7c" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8021a58aef79ef8e1c7c" title="1.1 术语定义"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">1.1 术语定义</span></span></h3><div class="notion-text notion-block-2613c18ff81c80cfb917eaa8852ef8c4">在强化学习中，可以将学习过程看作<b>智能体（agent）与环境（environment）的交互过程</b>。智能体能够感知环境状态、做出决策、执行动作，从而影响环境并获得奖励。为了更好地描述这一过程，通常定义以下基本概念：</div><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80c9977dd46c6e79c034"><li><b>状态（State, </b><em><b>S</b></em><b>）</b>：表示环境的随机变量。某个具体的状态记为小写 <!-- -->，其中 <!-- -->是<b>状态空间</b>。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80bbb8bddb72e6d183f5"><li><b>动作（Action, </b><em><b>A</b></em><b>）</b>：表示智能体可采取的动作的随机变量。具体动作记为<!-- -->，其中<!-- -->是<b>动作空间</b>。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80338cc2e095bfbf4d63"><li><b>状态转移（State Transition）</b>：当智能体在时刻<!-- -->处于状态<!-- --> 并采取动作<!-- -->后，环境转移到新的状态<!-- -->，记为</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c802687b5f4bc5d2fc004"><li><b>策略（Policy, </b><em></em><b>）</b>：描述智能体的决策规则，是一个条件概率分布, 它表示在状态<!-- -->下选择动作<!-- -->的概率：</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80bab54ce490518ac8ff"><li><b>奖励（Reward, </b><em></em><b>）</b>：环境在状态<!-- -->下对智能体采取动作<!-- -->所反馈的随机变量。奖励的一个取值记为<!-- -->，<!-- --> 是奖励空间，常写作<!-- -->。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80c290cdc9a8f584d4e8"><li><b>轨迹（Trajectory, </b><em><b>τ</b></em><b>）</b>：智能体与环境交互所产生的状态-动作-奖励序列，例如：</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80578676c9b3c8dbb834"><li><b>回报（Return, </b><em><b>Gt</b></em><b>）</b>：从时刻 <em>t</em> 开始累积的奖励，用于衡量长期收益。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c8086bad2c44f635d2c68"><li><b>即时奖励（Immediate Reward）</b>：当前时刻的奖励<!-- -->。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c805fa388d7558f25d43c"><li><b>未来奖励（Future Reward）</b>：从未来时刻累积的奖励。</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c803ab3eefa60a4ac61d6"><li><b>折扣回报（Discounted Return）</b>：为了避免无限轨迹回报发散，并调节对近期与远期奖励的重视程度，引入折扣因子<!-- -->：</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80b682a4de42d6b9fca8"><li><b>回合（Episode）</b>：从初始状态到终止状态的完整过程，即一条有限长度的轨迹。</li><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80b682a4de42d6b9fca8"><li><b>Episodic Task（回合式任务）</b>：具有明确起点和终点。</li><li><b>Continuing Task（持续性任务）</b>：没有终止状态，交互过程无限持续。</li></ul></ul><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-2613c18ff81c80c6933fcc0e3199d96c" data-id="2613c18ff81c80c6933fcc0e3199d96c"><span><div id="2613c18ff81c80c6933fcc0e3199d96c" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c80c6933fcc0e3199d96c" title="2 马尔可夫决策过程"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">2 马尔可夫决策过程</span></span></h2><div class="notion-text notion-block-2613c18ff81c804dbcd8c08ed6967521">通常我们会用马尔可夫决策过程(markov decision process （MDP）)来来描述强化学习。MDP是描述随机系统的一般框架。它由以下关键要素组成</div><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80b88206e13bbff993c6"><li>集合（Sets）</li><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80b88206e13bbff993c6"><li>状态空间集合<!-- -->。所有状态的集合</li><li>动作空间集合<!-- -->。与每一个状态<em></em>关联的动作集合，<em></em>是随机变量的一个值</li></ul></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c8003a3aadbd1850548ca"><li>奖励空间集合<!-- -->。与每一个<!-- -->关联的所有奖励集合</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80ebb43bc8ba01d0f0b9"><li>模型（model）</li><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80ebb43bc8ba01d0f0b9"><li>状态转移概率，</li><li>奖励概率，<em></em></li></ul></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c8049b27fdb9f42996cab"><li>策略（policy）：在状态<em></em>下，agent采取动作<em>a</em>的概率<em></em></li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80d39603f4e77347e110"><li>马尔可夫性质(markov property)：下一个状态和奖励仅依赖当前时刻的状态和动作，而与之前的状态和动作无关</li></ul><blockquote class="notion-quote notion-block-2613c18ff81c809ab20ae40571cfef8e"><div>马尔可夫过程与马尔可夫决策过程有什么联系？
当马尔可夫决策过程中的<b>策略</b>确定了，此时的马尔可夫决策过程就退化成了马尔科夫奖励过程，如果奖励也去掉，那么就进一步退化为马尔可夫过程。</div></blockquote><h2 class="notion-h notion-h1 notion-h-indent-0 notion-block-2613c18ff81c80be9fa4fbe977aa2918" data-id="2613c18ff81c80be9fa4fbe977aa2918"><span><div id="2613c18ff81c80be9fa4fbe977aa2918" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c80be9fa4fbe977aa2918" title="3 贝尔曼方程（Bellman Equation）"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3 贝尔曼方程（Bellman Equation）</span></span></h2><div class="notion-text notion-block-2613c18ff81c80e784e5d2be938e0976">在强化学习中，马尔可夫决策过程（MDP）是建模智能体与环境交互的核心<b>数学模型</b>，而贝尔曼方程（Bellman Equation）是求解 MDP 的<b>关键工具</b>。
强化学习的核心目标通过智能体(agent)与环境（environment）的交互来学习一个最优的策略（policy）使得累积回报（return）最大化。
实际中，由于状态转移的随机性，从一个状态出发，存在多条轨迹，每条轨迹的回报不同，<b>因此通常我们的优化目标是最大化期望回报</b>。</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-2613c18ff81c8033ba9fe3e8650bbc60" data-id="2613c18ff81c8033ba9fe3e8650bbc60"><span><div id="2613c18ff81c8033ba9fe3e8650bbc60" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8033ba9fe3e8650bbc60" title="3.1 状态值函数与贝尔曼方程"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.1 状态值函数与贝尔曼方程</span></span></h3><div class="notion-text notion-block-2613c18ff81c8053b460c3f20d74626b">定义在策略<!-- -->下，从状态 状态值函数 开始的期望累积回报为<b>状态值函数</b>（state-value function）,也简称为状态值（state value）。</div><div class="notion-text notion-block-2613c18ff81c80f091adcc0ca6dc0ba7">⚠️式子中的<em></em>是随机变量，<em></em>是随机变量的值。</div><div class="notion-text notion-block-2613c18ff81c8046bcc6fa3ac87161c1">最优策略<!-- -->可以定义为使状态值函数最大的策略</div><div class="notion-text notion-block-2613c18ff81c804c9dd6da5c15a149e6">折扣回报（Discounted Return）定义为：</div><ul class="notion-list notion-list-disc notion-block-2613c18ff81c808cbfd6d154be41c917"><li>是即时奖励 immediate reward</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80adba71c8ef3c413b44"><li>未来奖励 future reward</li></ul><div class="notion-text notion-block-2613c18ff81c80f1b99ce6fe00a17918">利用折扣回报定义，状态值函数可以拆分为即时奖励和未来回报的期望：</div><div class="notion-text notion-block-2613c18ff81c805785a6e15436eeda07">先看第一项即时奖励期望</div><div class="notion-text notion-block-2613c18ff81c803da359ec6ea7d4d95b">在看第二项未来回报的期望</div><div class="notion-text notion-block-2613c18ff81c80d6917bfe7ee6760a5f">将结果带入原式中，我们得到值函数形式如下，此时的值函数方程也称为<span class="notion-yellow_background">贝尔曼方程</span>。</div><div class="notion-text notion-block-2613c18ff81c808e9101d0571535486d">贝尔曼方程描述了不同状态值之间的关系。式中<span style="padding:0.5em"></span>代表系统模型。通过联立所有状态值<!-- -->我们可以求解出贝尔曼方程所有状态的值。</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-2613c18ff81c803fa577cef77b1a0fe9" data-id="2613c18ff81c803fa577cef77b1a0fe9"><span><div id="2613c18ff81c803fa577cef77b1a0fe9" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c803fa577cef77b1a0fe9" title="3.2 贝尔曼方程的联合概率形式"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.2 贝尔曼方程的联合概率形式</span></span></h3><div class="notion-text notion-block-2613c18ff81c80218f86c23e41c9179a">已知
</div><div class="notion-text notion-block-2613c18ff81c800eb13fe48193c7b130">代入初始的贝尔曼方程中</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-2613c18ff81c8016a35cc202bae4ef36" data-id="2613c18ff81c8016a35cc202bae4ef36"><span><div id="2613c18ff81c8016a35cc202bae4ef36" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8016a35cc202bae4ef36" title="
3.3贝尔曼方程的向量形式"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">
3.3贝尔曼方程的向量形式</span></span></h3><div class="notion-text notion-block-2613c18ff81c8095af99ec3bc30f5384">前面提到，对于每一个状态<!-- -->,我们都可以写一个值函数（贝尔曼方程），假定状态数为<!-- -->，则有联立形式：</div><div class="notion-text notion-block-2613c18ff81c80f28ec3f1b2b0c12a91">为了方便以向量化定义</div><div class="notion-text notion-block-2613c18ff81c80979298c0a8c9b3429f">此时的贝尔曼方程可以写做</div><div class="notion-text notion-block-2613c18ff81c804380e9d64bf3ce4b47">对于每一个状态<!-- --> 的联立形式</div><div class="notion-text notion-block-2613c18ff81c80f0b04de0b12d0cc6e4">用矩阵的形式表达，上面的联立形式可以写做</div><div class="notion-text notion-block-2613c18ff81c80c086e8c19669a8d01e">即</div><div class="notion-text notion-block-2613c18ff81c8037971beaa804614425">
</div><div class="notion-text notion-block-2613c18ff81c80eb9c6cea0ed5c3f675">根据上面贝尔曼方程的向量形式（本质上是一个线性方程组），我们有2种方法可以对其进行求解</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-2613c18ff81c8037a980c6e97faa712e" data-id="2613c18ff81c8037a980c6e97faa712e"><span><div id="2613c18ff81c8037a980c6e97faa712e" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8037a980c6e97faa712e" title="3.3.1 向量形式贝尔曼方程的求解"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.3.1 向量形式贝尔曼方程的求解</span></span></h4><div class="notion-text notion-block-2613c18ff81c80348cfdeeb1a2e0c1f2"><b>3.3.1.1 解析解</b></div><div class="notion-text notion-block-2613c18ff81c80758bd5d1b0dce2f1a8">上面的解析解有几个性质</div><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80018a38ef9c378c37e3"><li>可逆。（根据Gershgorin circle theorem可以证明满足,本文不做深究）</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80db862ac4a60cd57100"><li><em>. </em>（因为<em></em>, <!-- -->,根据矩阵逆的级数展开可以证明）</li></ul><ul class="notion-list notion-list-disc notion-block-2613c18ff81c80afa87aeefbb3df5472"><li>对于任何向量<!-- -->有<em>, </em>若<em></em>有<em></em></li></ul><div class="notion-blank notion-block-2613c18ff81c80cd8e71ef0ecb5f0b5e"> </div><div class="notion-text notion-block-2613c18ff81c801492e3c5749f610f7a"><b>3.3.1.2 数值解</b></div><div class="notion-text notion-block-2613c18ff81c800d93e7f2aa078c3ad4">根据下面的迭代式可以迭代求解</div><div class="notion-text notion-block-2613c18ff81c80e5ad13f8d5d7dcfa1d">随机初始一个<em>,</em>通过上面的迭代式可以求出真实的状态值<!-- -->,即</div><div class="notion-text notion-block-2613c18ff81c80e5b692cea1578821c0">为什么上面的迭代式能保证<!-- -->能够收敛到真实值，证明：</div><div class="notion-text notion-block-2613c18ff81c8024a6cbc12e2d8403e5">定义误差<!-- -->,只需证明<!-- -->
将误差代入迭代式中</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-2613c18ff81c8049b080e598170917dc" data-id="2613c18ff81c8049b080e598170917dc"><span><div id="2613c18ff81c8049b080e598170917dc" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c8049b080e598170917dc" title="3.4 动作值函数的贝尔曼方程"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.4 动作值函数的贝尔曼方程</span></span></h4><div class="notion-text notion-block-2613c18ff81c80d3ac9fd0c769e8b2af">动作值函数（Action Value Function, 简称动作值）也是强化学习一个非常重要的概念，其定义为</div><div class="notion-text notion-block-2613c18ff81c80eaa9f7d07e6644e02c">它表示在状态<!-- -->下采取动作<!-- -->后，沿着策略<!-- -->执行，所能获得的期望回报。</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-2613c18ff81c80d490fdcfaef349a1f0" data-id="2613c18ff81c80d490fdcfaef349a1f0"><span><div id="2613c18ff81c80d490fdcfaef349a1f0" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c80d490fdcfaef349a1f0" title="3.4.1 动作值与状态值的关系"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.4.1 动作值与状态值的关系</span></span></h4><div class="notion-text notion-block-2613c18ff81c80efaf40c6dc3fd725bf">状态值函数等于该状态下所有动作值的加权平均，其中权重是策略<!-- -->。</div><h4 class="notion-h notion-h3 notion-h-indent-2 notion-block-2613c18ff81c802e8de8d599c25d4656" data-id="2613c18ff81c802e8de8d599c25d4656"><span><div id="2613c18ff81c802e8de8d599c25d4656" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c802e8de8d599c25d4656" title="3.4.2 动作值的贝尔曼方程"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">3.4.2 动作值的贝尔曼方程</span></span></h4><div class="notion-text notion-block-2613c18ff81c806097a2fb1fde2e540a">根据式(8)和式（18），可以写出动作值的贝尔曼方程</div><h3 class="notion-h notion-h2 notion-h-indent-1 notion-block-2613c18ff81c80269e01e0ff15978c1e" data-id="2613c18ff81c80269e01e0ff15978c1e"><span><div id="2613c18ff81c80269e01e0ff15978c1e" class="notion-header-anchor"></div><a class="notion-hash-link" href="#2613c18ff81c80269e01e0ff15978c1e" title="小结"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">小结</span></span></h3><div class="notion-text notion-block-2613c18ff81c805c8012e9310a7bcaf7">本文系统介绍了强化学习的核心概念，包括状态、动作、策略、奖励和回报等基本术语；随后展示了如何使用马尔可夫决策过程（MDP）对强化学习问题进行数学建模；最后推导了状态值函数和动作值函数的贝尔曼方程。</div><blockquote class="notion-quote notion-block-2613c18ff81c80379c7bedea8cef4f91"><div>如有理解不当、笔误之处，敬请指出。非常感谢～</div></blockquote></main></div>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[RL学习小结 (005)： TRPO理论剖析]]></title>
            <link>http://www.myhz0606.com/article/trpo</link>
            <guid>http://www.myhz0606.com/article/trpo</guid>
            <pubDate>Thu, 19 Mar 2026 16:00:00 GMT</pubDate>
            <description><![CDATA[本文从TRPO提出的motivation出发，step by step系统的推导了TRPO的算法的设计过程及细节]]></description>
            <content:encoded><![CDATA[<div id="notion-article" class="mx-auto overflow-hidden "><main class="notion light-mode notion-page notion-block-23a3c18ff81c80e596c8ee448a386538"><div class="notion-viewport"></div><div class="notion-collection-page-properties"></div><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-3293c18ff81c80eca573df44e6f28531" data-id="3293c18ff81c80eca573df44e6f28531"><span><div id="3293c18ff81c80eca573df44e6f28531" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3293c18ff81c80eca573df44e6f28531" title="背景"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">背景</span></span></h3><div class="notion-text notion-block-3293c18ff81c8054bcd4ff3e00cc8080">根据策略梯度算法</div><div class="notion-text notion-block-3293c18ff81c8012902bc9564bcbe8b2">我们可以通过梯度上升的方法来更新策略：</div><div class="notion-text notion-block-3293c18ff81c804f926ff9e1f7180413">在实践中，由于目标函数的非凸性， <!-- -->(步长，or learning rate)的选择是个难题</div><ul class="notion-list notion-list-disc notion-block-3293c18ff81c804caf8df4a419d4079a"><li>如果<!-- -->很小，虽然能满足一阶近似，但训练效率很低</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c803aa082e7b2dd9553dc"><li>如果<!-- -->很大，由于目标函数的非凸性，无法保证更新的策略比旧策略好。若更新的策略变差，还会导致下一轮采样的数据变差，陷入恶性循环。</li></ul><div class="notion-text notion-block-3293c18ff81c80e495d7f7934ecb65fd">由此引出TRPO的motivation：如何保证训练效率下，让更新的策略比旧策略好。</div><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-3293c18ff81c80b282ece4f4d482acef" data-id="3293c18ff81c80b282ece4f4d482acef"><span><div id="3293c18ff81c80b282ece4f4d482acef" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3293c18ff81c80b282ece4f4d482acef" title="TRPO 算法"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">TRPO 算法</span></span></h3><div class="notion-text notion-block-3293c18ff81c80efbca5c401c25e12de">根据性能差异引理可知</div><div class="notion-text notion-block-3293c18ff81c8010972ecd00902c80a9">新策略相对旧策略的提升满足下面的式子 (<!-- -->是<!-- -->对应的new policy，<!-- -->是<!-- -->对应的old policy)</div><div class="notion-text notion-block-3293c18ff81c809eb8bce08207484850">新策略 <!-- --> 相对于旧策略 <!-- --> 的性能提升量，等于&quot;<b>用新策略去采样</b>，然后用旧策略的优势函数去评估&quot;所得到的<b>期望累计折扣优势</b>。</div><div class="notion-text notion-block-3293c18ff81c8027b5e5eb5bb237c013">然而这个式子有个“坑”，在更新参数前新策略<!-- -->是未知的，因此无法从<!-- -->来采样数据。</div><div class="notion-text notion-block-3293c18ff81c80ca8566fc26d0bd8bec">有一个naive的思路解决这个问题（rollback）</div><div class="notion-text notion-block-3293c18ff81c807f948fd8d86ebc9a84">流程：</div><ol start="1" class="notion-list notion-list-numbered notion-block-3293c18ff81c800ea57ce355a5bb19ae" style="list-style-type:decimal"><li>先根据policy gradient更新</li></ol><ol start="2" class="notion-list notion-list-numbered notion-block-3293c18ff81c80638d71da99903e28ee" style="list-style-type:decimal"><li>基于<!-- -->在环境中采样，根据式(3)计算性能提升量</li></ol><ol start="3" class="notion-list notion-list-numbered notion-block-3293c18ff81c80e7b3a4ffb1ac0b8f4f" style="list-style-type:decimal"><li>若没有提升则拒绝这次更新（rollback）</li></ol><div class="notion-text notion-block-3293c18ff81c800e965ee8e38c936337">这样做主要有一个非常明显的弊端：<b>采样成本很高</b>。以自回归模型为例，用已有轨迹只是一个prefill过程，并行度非常高，而采样需要的是token-level的串型生成的过程。</div><div class="notion-text notion-block-3293c18ff81c8049860cc3cb1bccc083">为了解决这个问题，TRPO引入了两个trick</div><div class="notion-text notion-block-3293c18ff81c8029bb74de519a11a659">TRPO的核心假设：<b>如果新旧策略变化不大</b>，不妨假设那么他们的状态分布也差不多,即</div><blockquote class="notion-quote notion-block-3293c18ff81c80feac4feebcfbb53a3d"><div>的定义：</div></blockquote><div class="notion-text notion-block-3293c18ff81c80e9b65eeaaa890360ce">带回式(3),并引入importance sampling技巧，将近似的<!-- -->记作</div><ul class="notion-list notion-list-disc notion-block-3293c18ff81c807c9b77e1838485aa6b"><li>式中的<!-- -->称为importance ratio</li></ul><div class="notion-text notion-block-3293c18ff81c809b897bcc5c59400f6e">通过以上处理，可以<b>只需用旧的策略去采样数据，就能评估新策略的性能提升</b>。</div><div class="notion-text notion-block-3293c18ff81c8030b038c9a28530611f">但在上面的推导中，我们使用<!-- -->来近似<!-- -->，<!-- -->和真实的<!-- -->存在误差，直觉上，<!-- -->和<!-- -->差异越大，误差越大。下面来分析二者的误差</div><div class="notion-text notion-block-3293c18ff81c807cbc81d97d6c2c4354">根据advantage function的定义，可以证明：<!-- -->,即<!-- -->,带入上式</div><div class="notion-text notion-block-3293c18ff81c806a9b39f75c51dfc974">令</div><div class="notion-text notion-block-3293c18ff81c80a18e3cfd3f35416e2c">为了继续推导，需要引入总变差散度（Total Variation Divergence， TV散度）。对于两个离散概率分布<!-- -->，<!-- -->，TV散度的定义为：</div><div class="notion-text notion-block-3293c18ff81c80ae9096db2fe0528901">因此可以将上式改写为：</div><div class="notion-text notion-block-3293c18ff81c80d4bf25cb5556a14724">根据状态分布差异引理（State Distribution Difference Lemma），可知</div><div class="notion-text notion-block-3293c18ff81c8023a3efd0efe6d00956">带回上式</div><div class="notion-text notion-block-3293c18ff81c8055acf7c6db92e19711">在机器学习中TV散度因为涉及绝对值，求导不便，根据Pinsker不等式，有</div><div class="notion-text notion-block-3293c18ff81c80398429f2708b717a40">记<!-- -->，带回上式</div><blockquote class="notion-quote notion-block-3293c18ff81c8021a91ae9e6c6c0c2b8"><div>⚠️注意上面的推导结果相比原论文大了两倍，因为原论文所用不等式为<!-- -->,而本文用的是更精确的形式：</div></blockquote><div class="notion-text notion-block-3293c18ff81c80eb893ec4305369b734">最后，展开绝对值</div><div class="notion-text notion-block-3293c18ff81c80a0b57ec435bad9868d">可以得到真实性能<!-- -->的下界</div><div class="notion-text notion-block-3293c18ff81c800980e8ea3dfd5ab4d0">这意味着，这意味着我们可以将目标函数设置为最大化右边的式子，从而保证<!-- -->单调不减。</div><div class="notion-text notion-block-3293c18ff81c805f9f1ff97f7c23cfd4">写成参数优化的形式，令</div><div class="notion-text notion-block-3293c18ff81c80f1860ec656567e8249">理论上的参数优化目标为：</div><ul class="notion-list notion-list-disc notion-block-3293c18ff81c806591bdddc1b7ba3b43"><li>: old policy的参数</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c801c9933e7f1dc58a2f0"><li> new policy的参数</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c809a9017f802f64fa480"><li>，惩罚因子</li></ul><div class="notion-text notion-block-3293c18ff81c8002b7a7facac6eb0c45">上面的理论看似来似乎很完美，但实际落地是不可行的。</div><div class="notion-text notion-block-3293c18ff81c804db02cd592531c7cfc">首先，在训练中，单个iteration只有有限条轨迹，难以遍历所有的状态找到<!-- -->。</div><div class="notion-text notion-block-3293c18ff81c80948f4edd894e5b73e4">其次，惩罚因子<!-- -->过大，使得KL divergence项的梯度过大，模型更新过于保守。</div><div class="notion-text notion-block-3293c18ff81c80bd81e7c14f56e5f68f">因此，上述理论虽然很优美，但无法实际用于工程实现。作者为此引入2个trick</div><div class="notion-text notion-block-3293c18ff81c809da4a5eaa669c57a57"><b>Trick1: 利用拉格朗日对偶性，将惩罚转为约束</b></div><div class="notion-text notion-block-3293c18ff81c80869367cf838ca28e89">既然将KL divengence当作减法惩罚项会导致步长更新太小，不如把它拿出来，将原问题转化为一个约束优化问题</div><div class="notion-text notion-block-3293c18ff81c80daac70da1945a3c05e"><b>Trick2: Average KL divengence 而非 max</b></div><div class="notion-text notion-block-3293c18ff81c80e5a486f101640b450e">具体而言，原先需要遍历所有的状态才能得到<!-- -->，若状态空间很大，在实践中是得不到的(intractable)。另外<!-- -->这一项对噪声也非常敏感。</div><div class="notion-text notion-block-3293c18ff81c8070bb9efe7f699ee197">为此，采用一个启发式的近似：我们不要求<!-- -->，而是要求<!-- -->，其中</div><blockquote class="notion-quote notion-block-3293c18ff81c80398ba8c022b871f78f"><div>这个替换是工程的妥协，在一定程度上破坏了理论的单调性。作者原话
This problem imposes a constraint that the KL divergence  is bounded at every point in the state space. While it is  motivated by the theory, this problem is impractical to solve  due to the large number of constraints. Instead, we can use  a heuristic approximation which considers the average KL  divergence</div></blockquote><div class="notion-text notion-block-3293c18ff81c806e8885f56da1768f31">综上讨论得到TRPO的最终优化形式如下：</div><div class="notion-text notion-block-3293c18ff81c80e68535f93962aae079">这个约束条件所封闭的解空间称之为trust region。</div><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-3293c18ff81c80aea64be482cefe8c30" data-id="3293c18ff81c80aea64be482cefe8c30"><span><div id="3293c18ff81c80aea64be482cefe8c30" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3293c18ff81c80aea64be482cefe8c30" title="TRPO算法的求解"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">TRPO算法的求解</span></span></h3><div class="notion-text notion-block-3293c18ff81c8013b398d163fc241c05">我们没有办法直接通过梯度下降的方式来优化式(19)带有约束的非线性目标。TRPO做了以下处理</div><div class="notion-text notion-block-3293c18ff81c801aad82cacd4ccbb6ca">首先对优化目标在<!-- -->出进行一阶泰勒展开近似</div><div class="notion-text notion-block-3293c18ff81c807e84a1f2d5ae67d46c">为了简化符号，记</div><div class="notion-text notion-block-3293c18ff81c80dc94f3f0fbe07c8d19">带回原式</div><div class="notion-text notion-block-3293c18ff81c80888011f7e2cde51b31">对约束<!-- -->在<!-- -->处进行二阶泰勒展开</div><ul class="notion-list notion-list-disc notion-block-3293c18ff81c80d48e75f934d6d93c7a"><li>是海森矩阵。</li></ul><div class="notion-text notion-block-3293c18ff81c800fa573ca92299c8f69">意味着KL divergence的两个分布相同，也表明KL divengence达到全局最小值，可以得出</div><div class="notion-text notion-block-3293c18ff81c80709824c2041f939bf1">带回上式</div><div class="notion-text notion-block-3293c18ff81c80aaaca4d1c13247ff98">此时，优化目标可以改写为</div><div class="notion-text notion-block-3293c18ff81c80d88135f09fd1ba4cd4">通过上述处理，一个复杂的非线性优化问题转化为一个经典的二次约束线性优化问题。</div><div class="notion-text notion-block-3293c18ff81c80b49e4cd8f097d8e691">我们可以根据拉格朗日乘子法求取解析解。</div><div class="notion-text notion-block-3293c18ff81c80e8bee6fdade9c872a9">我们构造拉格朗日函数</div><div class="notion-text notion-block-3293c18ff81c801b83c5c69934c33c01">分别对<span style="padding:0.5em"></span>求偏导，令偏导为0</div><div class="notion-text notion-block-3293c18ff81c80c4969fe353d09932be">联立求解得到</div><div class="notion-text notion-block-3293c18ff81c80c792b0c5b75761ca07">式子中的<!-- -->其实就是自然梯度方向(Natural Gradient)，<!-- -->为做了<!-- -->缩放后的自然梯度方向。</div><div class="notion-text notion-block-3293c18ff81c806ba91afc60a4318e50">最终得到<!-- -->的解析解</div><div class="notion-text notion-block-3293c18ff81c80e09535e88a94f75a9a">直接通过解析解求解在实践中是不可行的。假定<!-- -->(对于模型来说这个<!-- -->很大),那么<!-- -->，显存可能存在瓶颈，更别说还要求逆。在实践中，作者采用CG算法来估计自然梯度</div><div class="notion-text notion-block-3293c18ff81c805e987df9a5b249127d">下面来看如何用CG求</div><div class="notion-text notion-block-3293c18ff81c80489cbde117790cdd2c">我们的目标是求解(<!-- -->)</div><div class="notion-text notion-block-3293c18ff81c8082887dea2840b658ca">常规的方法会涉及到海森矩阵的计算，代价不可接受。现在对他做一些转化，显然，上式可写做</div><div class="notion-text notion-block-3293c18ff81c80228549fe40d23586e3">下面对<!-- -->的转化很关键。为了方便符号记录，我们定义：</div><div class="notion-text notion-block-3293c18ff81c80ca8375d49c8e952883">因此<!-- -->可写做</div><div class="notion-text notion-block-3293c18ff81c803289d1f744f114931b">而<!-- -->，这大大降低了计算复杂度(这个技巧在sliced score matching中也有用到)，这个trick称之为Hessian-vector product trick。</div><div class="notion-text notion-block-3293c18ff81c80959289d82c17fcf3d1">因此最终需要求解的目标为</div><div class="notion-text notion-block-3293c18ff81c806e81eeed09f1fc8d1d">求解式34可转化为求解函数<!-- -->的零点，也可转化为求解函数<!-- -->的极小值点。</div><div class="notion-text notion-block-3293c18ff81c807cace2f132e27e0c4c">因此求解式34等价于求解二次函数的极小值问题</div><div class="notion-text notion-block-3293c18ff81c8062b097f421ea02dfe8">可以用CG算法求解这个式子。</div><div class="notion-text notion-block-3293c18ff81c80178423d29ff5e3d3c6">为最终的求解结果，即</div><div class="notion-text notion-block-3293c18ff81c8002a022c80a942e27e0">带回到式39中，可以得到实际的参数更新方向和步长为</div><div class="notion-text notion-block-3293c18ff81c8023bd53e5001c762c76">因此</div><div class="notion-text notion-block-3293c18ff81c80c1946df956a11b05d4">式中的<!-- -->决定优化的方向，<!-- -->确定更新的步长，二者共同决定policy的参数优化方向。</div><div class="notion-text notion-block-3293c18ff81c802aa15bdd56033c86d1">最后，由于优化函数的引入了泰勒近似，不一定能满足原来的约束条件。在实际上作者还额外引入了line search的校验机制。具体来说：</div><ul class="notion-list notion-list-disc notion-block-3293c18ff81c8088ba70fa64b078c471"><li>step0:初始阶段设置步长为最大步长</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c8060ac00fc4aa9d5a3b8"><li>step1: 更具步长更新policy参数</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c804aae68d8b73c9daff7"><li>step2: 带入到式19原始的约束函数中校验是否满足trust region的约束与性能是否提升。</li></ul><ul class="notion-list notion-list-disc notion-block-3293c18ff81c8037a2cdf92e9302276f"><li>step3: 若满足则接受此次更新，若不满足则降低步长重复step1-2</li></ul><h3 class="notion-h notion-h2 notion-h-indent-0 notion-block-3293c18ff81c801fb9ded4b410d1e363" data-id="3293c18ff81c801fb9ded4b410d1e363"><span><div id="3293c18ff81c801fb9ded4b410d1e363" class="notion-header-anchor"></div><a class="notion-hash-link" href="#3293c18ff81c801fb9ded4b410d1e363" title="小结"><svg viewBox="0 0 16 16" width="16" height="16"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a><span class="notion-h-title">小结</span></span></h3><div class="notion-text notion-block-3293c18ff81c80f58481e2ceed9a1eed">本文从TRPO提出的motivation出发，step by step系统的推导了TRPO的算法的设计过程及细节。如有不当之处，敬请之处。</div><div class="notion-blank notion-block-3293c18ff81c8088835ddeadf8034484"> </div></main></div>]]></content:encoded>
        </item>
    </channel>
</rss>