网站建站方案书重庆装修公司口碑哪家好
2026/1/16 9:30:18 网站建设 项目流程
网站建站方案书,重庆装修公司口碑哪家好,seo技巧,wordpress响应式博客主题LLMs之RL#xff1a;《LightSearcher: Efficient DeepSearch via Experiential Memory》翻译与解读 导读#xff1a;LightSearcher是一个创新的强化学习框架#xff0c;旨在解决DeepSearch范式中大型语言模型在准确性和效率之间的固有矛盾。它通过引入文本经验记忆#xff…LLMs之RL《LightSearcher: Efficient DeepSearch via Experiential Memory》翻译与解读导读LightSearcher是一个创新的强化学习框架旨在解决DeepSearch范式中大型语言模型在准确性和效率之间的固有矛盾。它通过引入文本经验记忆学习并提炼成功与失败推理轨迹的对比经验生成可解释的策略指导同时设计了一种自适应奖励整形机制仅在答案正确时惩罚冗余的工具调用。实验结果表明LightSearcher在保持与SOTA基线相当的准确性下显著降低了搜索工具调用、推理时间和Token消耗展现出卓越的效率和跨领域泛化能力为构建更高效、更智能的搜索增强型AI系统提供了有效途径尽管其在训练资源和应用场景扩展方面仍有进一步探索空间。背景痛点● 知识边界限制深度推理模型受限于其参数化知识难以获取最新、领域特定或事实密集型信息影响推理的深度和事实可靠性。● 准确性与效率的权衡难题基于强化学习RL的DeepSearch系统在准确性和效率之间存在“跷跷板”式的权衡。频繁调用工具虽能提高准确性但导致不必要的计算开销和效率下降。● 过度依赖搜索工具模型常表现出过度和不加区分的搜索工具调用即使对于可以通过其内在参数知识回答的查询也倾向于进行检索。● RL奖励函数设计缺陷这种过度依赖源于标准RL奖励函数主要优先考虑答案正确性导致模型为最大化准确性而增加工具调用频率进而造成计算开销、Token消耗和推理效率的降低。● 现有方法局限性虽然一些研究尝试通过简单效率惩罚来缓解问题但这些方法常导致性能下降且多依赖启发式提示或手动标注缺乏对动态查询的适应性与RL驱动的DeepSearch训练不兼容。具体的解决方案● LightSearcher框架提出一个高效的RL框架LightSearcher通过结合文本经验记忆和自适应奖励整形机制来解决DeepSearch中固有的准确性与效率权衡问题。● 对比经验推理通过学习对比推理轨迹生成可解释的成功推理模式摘要形成文本经验记忆。● 自适应奖励整形引入一种自适应奖励整形机制仅在答案正确的情况下惩罚冗余的工具调用避免在未达准确性时牺牲效率。核心思路步骤(1)、LightSearcher管道● 对比经验推理模块动态利用从过去推理轨迹中学习到的对比经验摘要进行RL优化将隐性经验转化为显性、可解释的文本指导。● 自适应奖励整形模块通过仅在正确答案场景下惩罚过多的搜索工具调用来平衡准确性和效率。● 基于经验的RL训练模块在策略执行过程中将累积的经验记忆和少量示例整合到提示中以增强策略优化。(2)、对比经验推理● 对比轨迹收集在每个训练迭代中收集当前策略生成的推理轨迹集并使用多目标奖励函数计算其综合奖励分数。根据奖励分数将轨迹分为“好轨迹”奖励分数为1和“坏轨迹”奖励分数低于阈值。● 经验生成为每条轨迹提供其性能的明确文本解释包括F1分数、奖励和解释。然后使用大型语言模型LLM对高质量和低质量轨迹的汇总信息进行对比分析生成自然语言形式的指导方针即“经验”描述有效的推理模式。经验记忆每5步更新一次。(3)、自适应奖励整形● 惩罚机制针对已正确回答的查询记录其过去训练轨迹中正确答案所需的最小工具调用次数n。当当前轨迹的F1分数达到阈值时如果实际工具调用次数m超过n则应用指数衰减惩罚Tool(τ) e^(-λ * max(0, m-n))。如果F1分数未达阈值则工具调用惩罚为0。● 格式奖励将格式合规性作为基本约束不合规为-1合规为0。● F1分数直接作为任务级别的准确性度量。● 总奖励函数如果格式不正确总奖励为-1否则总奖励为W_α * F1(τ) W_β * Tool(τ)其中W_α和W_β是调整准确性和工具调用重要性的超参数。(4)、基于经验的RL训练● GRPO训练采用GRPO算法进行模型训练。● 增强提示在每个训练迭代中将经验记忆库中的所有经验集成到模型的输入提示中提供全面的指导。同时随机选择一个高质量的轨迹作为少样本示例。● 提示模板最终的增强提示模板结合了指令、经验记忆、少样本示例和当前查询。● 优化目标GRPO训练过程利用这个增强提示和自适应多目标奖励函数计算优势函数来优化策略。优势● 显著提高效率在保持与SOTA基线ReSearch相当的准确性的同时平均减少39.6%的搜索工具调用48.6%的推理时间和21.2%的Token消耗。●有效平衡准确性与效率通过对比经验记忆和自适应奖励整形机制成功解决了DeepSearch中固有的准确性与效率之间的权衡问题。● 提供可解释的指导通过学习对比推理轨迹生成可解释的成功推理模式摘要为自主搜索工具调用优化提供明确指导。● 强大的泛化能力在域内和域外数据集上都表现出强大的泛化性能表明学习到的经验捕捉了基本的推理策略而非任务特定模式。● 推理阶段无需经验随着训练的进行经验文本对推理性能的影响逐渐减弱模型在推理阶段可以消除对这些经验的依赖从而避免不必要的计算成本。●自适应检索策略LightSearcher展现出对多样化查询的鲁棒自适应能力正确答案所需的工具调用次数显著少于不正确答案表明其采用了更灵活的检索策略。一些和结论观点(经验与建议)● 结论LightSearcher通过在RL框架中整合对比经验记忆和自适应奖励整形推动了搜索增强推理的发展使LLM能够战略性地调用工具平衡准确性和效率。通过学习过去的推理轨迹经验该方法促进了更高效和自适应的搜索工具增强型AI系统提高了跨任务的资源利用率。● 当前局限性LightSearcher目前适用于受控推理场景并在RL训练期间需要额外的计算资源这可能对其更广泛的扩展性构成挑战。● 未来方向尽管已在多跳问答任务上进行了验证但未来的研究方向包括将其应用扩展到代码合成和战略规划等领域。● 经验与建议经验记忆的演变过程显示模型从早期关注格式正确性到验证检索内容再到最终优化检索效率这种渐进式细化能力是平衡效率和性能的关键。目录《LightSearcher: Efficient DeepSearch via Experiential Memory》翻译与解读Abstract1、IntroductionFigure 1.Illustration of Excessive Search Tool Usage in existing DeepSearch systems across both Easy and Hard queries, leading to degraded efficiency in deep reasoning models.图 1. 现有 DeepSearch 系统在简单和复杂查询中过度使用搜索工具的示例导致深度推理模型效率降低。7 Conclusion《LightSearcher: Efficient DeepSearch via Experiential Memory》翻译与解读地址https://arxiv.org/abs/2512.06653时间[v1] 2025 年 12 月 7 日[v2] 2025 年 12 月 9 日[v3] 2025 年 12 月 10 日作者百嘉AI团队、北京邮电大学AbstractDeepSearch paradigms have become a core enabler for deep reasoning models, allowing them to invoke external search tools to access up-to-date, domain-specific knowledge beyond parametric boundaries, thereby enhancing the depth and factual reliability of reasoning. Building upon this foundation, recent advances in reinforcement learning (RL) have further empowered models to autonomously and strategically control search tool usage, optimizing when and how to query external knowledge sources. Yet, these RL-driven DeepSearch systems often reveal a see-saw trade-off between accuracy and efficiency-frequent tool invocations can improve factual correctness but lead to unnecessary computational overhead and diminished efficiency. To address this challenge, we propose LightSearcher, an efficient RL framework that incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. In addition, it employs an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios. This design effectively balances the inherent accuracy-efficiency trade-off in DeepSearch paradigms. Experiments on four multi-hop QA benchmarks show that LightSearcher maintains accuracy comparable to SOTA baseline ReSearch, while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%, demonstrating its superior efficiency.深度搜索范式已成为深度推理模型的核心推动力使它们能够调用外部搜索工具获取超出参数边界的最新、特定领域的知识从而增强推理的深度和事实可靠性。在此基础上强化学习RL的最新进展进一步赋予模型自主且策略性地控制搜索工具使用的权力优化何时以及如何查询外部知识源。然而这些由 RL 驱动的深度搜索系统往往在准确性和效率之间呈现出一种此消彼长的权衡——频繁调用工具可以提高事实的准确性但会导致不必要的计算开销和效率降低。为了解决这一挑战我们提出了 LightSearcher这是一种高效的 RL 框架通过学习对比推理轨迹来整合文本经验记忆生成成功推理模式的可解释摘要。此外它还采用了一种自适应奖励塑造机制仅在正确答案的情况下惩罚冗余的工具调用。这种设计有效地平衡了深度搜索范式中固有的准确性和效率之间的权衡。在四个多跳问答基准上的实验表明LightSearcher 保持了与最先进的基准 ReSearch 相当的准确率同时减少了 39.6% 的搜索工具调用次数、48.6% 的推理时间和 21.2% 的标记消耗展示了其卓越的效率。1、IntroductionDeep reasoning models have showcased remarkable capabilities across a wide range of tasks (measuring; deepseek-r1), yet they are inherently constrained by their parametric knowledge—struggling to access up-to-date information, domain-specific insights, or fact-intensive details critical for comprehensive and reliable responses (medical-research; llm-in-finance). As a core enabler for overcoming this limitation, DeepSearch paradigms have become indispensable in advancing large reasoning models’ performance: by enabling models to invoke external search tools, they break through parametric boundaries to integrate external knowledge, thereby substantially enhancing the depth and factual reliability of reasoning.To fully leverage the potential of DeepSearch, mainstream methodologies have explored Retrieval-Augmented Generation (RAG) techniques for integrating externally retrieved information into the reasoning pipeline (rag; rag-survey; kg-retriever; iter-retgen; emotional-rag; memoryos). Early paradigms relied on supervised learning with manually annotated reasoning chains (ircot; iter-retgen) to guide tool invocation and retrieval. However, these suffer from high annotation costs and poor generalization, as manually crafted chains cannot adapt to the diversity of real-world queries (research). Recent advances in reinforcement learning (RL) have mitigated these limitations by enabling models to autonomously and strategically regulate search tool utilization, while simultaneously learning optimal policies for determining when and how to query external knowledge sources (research; search-r1). Representative methods such as ReSearch (research) and Search-R1 (search-r1) have yielded substantial performance gains on multi-hop question answering benchmarks (nq; hotpotqa; musique; 2wiki).深度推理模型在众多任务中展现出了卓越的能力测量deepseek-r1但它们本质上受限于参数化的知识——难以获取最新的信息、特定领域的见解或事实密集型的细节这些对于全面且可靠的响应至关重要医学研究llm 在金融领域。作为克服这一局限性的核心手段深度搜索范式在提升大型推理模型性能方面变得不可或缺通过使模型能够调用外部搜索工具它们突破了参数化的限制整合了外部知识从而极大地增强了推理的深度和事实可靠性。为了充分发挥深度搜索的潜力主流方法探索了检索增强生成RAG技术将外部检索到的信息整合到推理流程中ragrag 调查kg 检索器迭代检索生成情感 RAG记忆操作系统。早期范式依赖于带有手动标注推理链的监督学习ircot迭代检索生成来指导工具调用和检索。然而这些方法存在标注成本高和泛化能力差的问题因为人工编写的链路无法适应现实世界查询的多样性研究。强化学习RL的最新进展通过使模型能够自主且策略性地调节搜索工具的使用同时学习确定何时以及如何查询外部知识源的最佳策略从而缓解了这些限制研究搜索-r1。诸如 ReSearch研究和 Search-R1搜索-r1等代表性方法在多跳问答基准测试nqhotpotqamusique2wiki中取得了显著的性能提升。However, these RL-driven DeepSearch approaches face a critical dilemma: a see-saw trade-off between accuracy and efficiency. As illustrated in Fig. 1, models often exhibit excessive and indiscriminate search tool calls—resorting to retrieval even for queries that can be adequately answered using their intrinsic parametric knowledge. This over-reliance stems from the limitations of standard RL reward function designs, which primarily prioritize answer correctness. To maximize accuracy, models tend to increase tool invocation frequency, leading to unnecessary computational overhead, elevated token consumption, and diminished reasoning efficiency. While recent studies have attempted to mitigate this issue with simple efficiency penalties (otc-po), such approaches often result in performance degradation, as scalar reward optimization fails to fundamentally balance the dual objectives of accuracy and efficiency.然而这些由强化学习驱动的 DeepSearch 方法面临着一个关键的困境准确性与效率之间的跷跷板式权衡。如图 1 所示模型常常过度且不加选择地调用搜索工具——即使对于那些仅依靠其内在参数知识就能充分回答的查询也进行检索。这种过度依赖源于标准强化学习奖励函数设计的局限性其主要优先考虑答案的正确性。为了提高准确性模型往往会增加工具调用的频率这会导致不必要的计算开销、增加标记消耗以及降低推理效率。尽管近期的研究试图通过简单的效率惩罚otc-po来缓解这一问题但此类方法往往会导致性能下降因为标量奖励优化无法从根本上平衡准确性和效率这两个目标。To address this unmet challenge, we propose LightSearcher, an efficient RL framework tailored for DeepSearch paradigms. LightSearcher integrates textual experiential memory by learning contrastive reasoning trajectories, distilling interpretable summaries of successful tool-invocation and reasoning patterns. Furthermore, it incorporates an adaptive reward shaping mechanism that penalizes redundant tool calls exclusively in correct-answer scenarios—avoiding efficiency sacrifices when accuracy is not yet achieved. By fusing experiential memory guidance with adaptive reward optimization, LightSearcher effectively resolves the inherent accuracy–efficiency trade-off in DeepSearch. Our key contributions are summarized as follows:• We propose LightSearcher, an efficient RL framework tailored for DeepSearch, which integrates contrastive experiential memory to deliver explicit and interpretable guidance for the optimization of autonomous search tool invocation.• We design a novel adaptive reward shaping mechanism that dynamically balances accuracy and efficiency, penalizing redundant tool usage only when answers are correct.• Comprehensive experiments on four multi-hop QA benchmarks demonstrate that LightSearcher reduces search tool invocations by 39.6% while maintaining comparable accuracy to the state-of-the-art baseline ReSearch, verifying its superiority in model efficiency.为了解决这一尚未解决的挑战我们提出了 LightSearcher这是一个专为深度搜索范式设计的高效强化学习框架。LightSearcher 通过学习对比推理轨迹来整合文本经验记忆提炼出成功的工具调用和推理模式的可解释摘要。此外它还引入了一种自适应奖励塑造机制仅在正确答案的情况下惩罚冗余的工具调用——避免在准确性尚未达成时牺牲效率。通过将体验记忆引导与自适应奖励优化相融合LightSearcher 有效解决了 DeepSearch 中固有的准确率与效率之间的权衡问题。我们的主要贡献总结如下• 我们提出了 LightSearcher这是一种专为 DeepSearch 设计的高效强化学习框架它整合了对比体验记忆为自主搜索工具调用的优化提供了明确且可解释的指导。• 我们设计了一种新颖的自适应奖励塑造机制能够动态平衡准确率与效率仅在答案正确时对冗余工具使用进行惩罚。• 在四个多跳问答基准上的全面实验表明LightSearcher 在保持与最先进的基准 ReSearch 相当的准确率的同时将搜索工具调用减少了 39.6%验证了其在模型效率方面的优越性。Figure 1.Illustration of Excessive Search Tool Usage in existing DeepSearch systems across both Easy and Hard queries, leading to degraded efficiency in deep reasoning models.图 1. 现有 DeepSearch 系统在简单和复杂查询中过度使用搜索工具的示例导致深度推理模型效率降低。7ConclusionLightSearcher advances search-enhanced reasoning by integrating contrastive experiential memory and adaptive reward shaping within an RL framework, enabling LLMs to strategically invoke tools while balancing accuracy and efficiency. By enabling models to learn experience from past reasoning trajectories, our method promotes more efficient and adaptive search tool-augmented AI systems, improving resource utilization across diverse tasks. Despite these advancements, LightSearcher is currently suited to controlled reasoning scenarios and demands additional computational resources during RL training, which may present challenges for broader scalability. Additionally, while validated on multi-hop QA tasks, future directions include extending its application to domains like code synthesis and strategic planning.LightSearcher 通过在强化学习框架内整合对比体验记忆和自适应奖励塑造推进了搜索增强推理的发展使大型语言模型能够策略性地调用工具同时平衡准确性和效率。通过让模型从过去的推理轨迹中学习经验我们的方法促进了更高效、更适应性的搜索工具增强型人工智能系统的形成提高了在各种任务中的资源利用率。尽管取得了这些进展但 LightSearcher 目前适用于受控推理场景并且在强化学习训练期间需要额外的计算资源这可能对其更广泛的可扩展性构成挑战。此外虽然已在多跳问答任务中得到验证但未来的研究方向包括将其应用扩展到代码合成和战略规划等领域。

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询