

浏览全部资源
扫码关注微信
1.南京大学电子科学与工程学院,江苏南京 210023
2.南京风语智能信息技术有限公司,江苏南京 210000
3.南京大学历史学院,江苏南京 210023
4.中山大学集成电路学院,广东深圳 518107
Received:12 May 2025,
Accepted:24 February 2026,
Published:25 February 2026
移动端阅览
路思远, 叶尔潘·托合提亚尔 , 朱菀晔, 等. 基于大模型的RAG算法的快速评估系统:RGE-Pipeline[J]. 电子学报, 2026, 54(02): 750-764.
LU Siyuan, YEERPAN Tuohetiyaer, ZHU Yuye, et al. RGE-Pipeline: Fast Evaluation for LLM-Based Retrieval-Augmented Generation Systems[J]. Acta Electronica Sinica, 2026, 54(02): 750-764.
路思远, 叶尔潘·托合提亚尔 , 朱菀晔, 等. 基于大模型的RAG算法的快速评估系统:RGE-Pipeline[J]. 电子学报, 2026, 54(02): 750-764. DOI:10.12263/DZXB.20250372
LU Siyuan, YEERPAN Tuohetiyaer, ZHU Yuye, et al. RGE-Pipeline: Fast Evaluation for LLM-Based Retrieval-Augmented Generation Systems[J]. Acta Electronica Sinica, 2026, 54(02): 750-764. DOI:10.12263/DZXB.20250372
检索增强生成(Retrieval-Augmented Generation,RAG)技术将大语言模型(Large Language Model,LLM)与检索系统相结合,凭借可溯源、可解释、知识更新成本低等优势,已成为LLM落地的主流方案。然而,RAG系统上线前需要经过严苛评估,开发者要构建大规模知识库,并以数千条查询进行全面测试,其间涉及密集检索与反复LLM调用,导致评估耗时极长,严重制约企业级AI的研发迭代效率。为破解这一瓶颈,本文提出RGE-Pipeline(检索器-生成器-评估器流水线)系统,面向基于LLM的RAG算法提供高吞吐、可扩展的快速评估方案。首先,通过预实验对典型RAG系统进行量化分析,定位检索器初始性能瓶颈,引入BM25S替代传统BM25算法,使检索耗时占比降到4%以下,将瓶颈转移至生成与评估阶段。在此基础上,RGE-Pipeline从3个层面进行系统级优化:其一,将评估流程解耦为检索、生成、评估3个模块,引入流水线并行架构,消除串行等待与模型反复加载带来的开销;其二,基于vLLM推理框架设计精细化硬件资源管理方案,支持在同一组GPU上并发部署多个LLM实例;其三,构建数学模型,揭示生成器与评估器之间显存分配比例与系统整体吞吐率的定量关系,并提出3种GPU资源分配策略——显存共享、整卡分配与混合分割,通过平衡两阶段计算负载实现吞吐率最大化。基于CRUD-RAG数据集(涵盖文本续写、摘要生成、多文档问答与幻觉修改等任务,共计6 400条查询)的实验证明,在固定使用BM25S、生成器与评估器均采用Qwen2.5-7B的条件下,RGE-Pipeline展现出显著的加速效果。相比原始串行工作流(耗时约95 h),混合分割方案将总评估时间压缩至1.3 h,加速比达71.7倍;相比模型预加载工作流(耗时约10.8 h),加速比达8.2倍。此外,扩展性实验也表明,RGE-Pipeline在不同知识库规模(18 KB~60 MB)及小规模查询集上均具备良好的适应性。总之,RGE-Pipeline不仅大幅降低了RAG算法的验证成本,还为多LLM并行推理场景的系统优化提供了可借鉴的设计思路。
Retrieval-augmented generation (RAG)
which integrates large language model (LLM) with retrieval systems
has become a mainstream solution for LLM deployment due to its advantages in traceability
interpretability
and low cost for knowledge updates. However
before deployment
RAG systems require rigorous evaluation: developers must construct large-scale knowledge bases and conduct comprehensive tests with thousands of queries
involving intensive retrieval computations and repeated LLM calls. This results in extremely time-consuming evaluation processes
severely hindering the development and iteration of enterprise-level AI systems. To address this bottleneck
we propose RGE-Pipeline (Retriever-Generator-Evaluator Pipeline)
a high-throughput
scalable
and fast evaluation framework for LLM-based RAG algorithms. We first conduct preliminary experiments to quantitatively analyze the performance bottlenecks in typical RAG systems
identifying the retriever as the initial bottleneck. By replacing the traditional BM25 algorithm with BM25S
we reduce the retrieval time proportion to below 4%
shifting the bottleneck to the generation and evaluation stages. Building on this
RGE-Pipeline performs system-level optimization from three aspects: (1) decoupling the evaluation workflow into three modules—retriever
generator
and evaluator—and introducing a pipeline parallel architecture to eliminate serial waiting and repeated model loading overhead; (2) designing a fine-grained hardware resource management scheme based on the vLLM inference framework to support concurrent deployment of multiple LLM instances on the same set of GPUs; (3) constructing a mathematical model that reveals the quantitative relationship between the memory allocation ratio of the generator and evaluator and the overall system throughput
proposing three GPU resource allocation strategies—shared VRAM
full-card allocation
and hybrid partitioning—to maximize throughput by balancing the computational load between the two stages. Experiments conducted on the CRUD-RAG dataset
which covers tasks such as text continuation
summarization
multi-document question answering
and hallucination modification with a total of 6 400 queries
demonstrate the significant acceleration achieved by RGE-Pipeline. Under the fixed configuration using BM25S and Qwen2.5-7B for both the generator and evaluator
the hybrid partitioning scheme reduces the total evaluation time from approximately 95 hours (original serial workflow) to 1.3 hours
achieving a speedup of 71.7×
and from approximately 10.8 hours (model-preloading workflow) to 1.3 hours
achieving a speedup of 8.2×. Furthermore
extensibility experiments confirm that RGE-Pipeline maintains strong adaptability across different knowledge base sizes (ranging from 18 KB to 60 MB) and on small-scale query sets. In summary
RGE-Pipeline not only significantly reduces the validation cost of RAG algorithms but also provides a reference design for system optimization in multi-LLM parallel inference scenarios.
Huang X J , Liu Z Y , Zhang M , et al . Towards a comprehensive understanding of the impact of large language models on natural language processing: Challenges, opportunities and future directions [J ] . Scientia Sinica Informationis , 2023 , 53 ( 9 ): 1645 . DOI: 10.1360/ssi-2023-0113 http://dx.doi.org/10.1360/ssi-2023-0113
Lyu Y J , Li Z Y , Niu S M , et al . CRUD-RAG: A comprehensive Chinese benchmark for retrieval-augmented generation of large language models [J ] . ACM Transactions on Information Systems , 2025 , 43 ( 2 ): 1 - 32 . DOI: 10.1145/3701228 http://dx.doi.org/10.1145/3701228
Chang S C , He T , Hu X K , et al . RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 21999 - 22027 . DOI: 10.52202/079017-0692 http://dx.doi.org/10.52202/079017-0692
Şakar T , Emekci H . Maximizing RAG efficiency: A comparative analysis of RAG methods [J ] . Natural Language Processing , 2025 ; 31 ( 1 ): 1 - 25 . DOI: 10.1017/nlp.2024.53 http://dx.doi.org/10.1017/nlp.2024.53
Lu X H . BM25S: Orders of magnitude faster lexical search via eager sparse scoring [PP/OL ] . V1.arXiv ( 2024-07-04 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2407.03618 https://doi.org/10.48550/arXiv.2407.03618 .
Kwon W , Li Z H , Zhuang S Y , et al . Efficient memory management for large language model serving with PagedAttention [C ] // Proceedings of the 29th Symposium on Operating Systems Principles . New York : ACM , 2023 : 611 - 626 . DOI: 10.1145/3600006.3613165 http://dx.doi.org/10.1145/3600006.3613165
Kaddour J , Harris J , Mozes M , et al . Challenges and applications of large language models [PP/OL ] . V1.arXiv ( 2023-07-19 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2307.10169 https://doi.org/10.48550/arXiv.2307.10169 .
Minaee S , Mikolov T , Nikzad N , et al . Large language models: A survey [PP/OL ] . V3.arXiv ( 2025-03-23 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2402.06196 https://doi.org/10.48550/arXiv.2402.06196 .
许婷 , 肖桐 , 张圣林 , 等 . 基于LLM的日志故障诊断 [J ] . 电子学报 , 2025 , 53 ( 4 ): 1123 - 1141 .
Xu Ting , Xiao Tong , Zhang Shenlin , et al . Log fault diagnosis based on large language models [J ] . Acta Electronica Sinica , 2025 , 53 ( 4 ): 1123 - 1141 . (in Chinese)
Radford A , Narasimhan K , Salimans T , et al . Improving language understanding by generative pre-training [R ] . San Francisco : OpenAI , 2018 .
Radford A , Wu J , Child R , et al . Language models are unsupervised multitask learners [R ] . San Francisco : OpenAI , 2019 .
Brown T B , Mann B , Ryder N , et al . Language models are few-shot learners [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 1877 - 1901 . DOI: 10.18653/v1/2021.emnlp-main.734 http://dx.doi.org/10.18653/v1/2021.emnlp-main.734
Achiam J , Adler S , Agarwal S , et al . GPT-4 Technical Report [R ] . San Francisco : OpenAI , 2023 .
Guo D , Yang D , Zhang H , et al . DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning [J ] . Nature , 2025 , 645 ( 8081 ): 633 - 638 .
秦钰淑 , 杨良怀 , 朱艳超 , 等 . 融合图像与文本特征的组合检索方法 [J ] . 电子学报 , 2025 , 53 ( 2 ): 558 - 567 .
Qin Yushu , Yang Lianghuai , Zhu Yanchao , et al . A combined retrieval method by fusing image and text features [J ] . Acta Electronica Sinica , 2025 , 53 ( 2 ): 558 - 567 . (in Chinese)
Gao Y F , Xiong Y , Gao X Y , et al . Retrieval-augmented generation for large language models: A survey [PP/OL ] . V5.arXiv ( 2024-03-27 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2312.10997 https://doi.org/10.48550/arXiv.2312.10997 .
Hu Z B , Wang C , Shu Y F , et al . Prompt perturbation in retrieval-augmented generation based large language models [C ] // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . New York : ACM , 2024 : 1119 - 1130 . DOI: 10.1145/3637528.3671932 http://dx.doi.org/10.1145/3637528.3671932
Sparck Jones K . A statistical interpretation of term specificity and its application in retrieval [J ] . Journal of Documentation , 1972 , 28 ( 1 ): 11 - 21 . DOI: 10.1108/eb026526 http://dx.doi.org/10.1108/eb026526
Robertson S , Zaragoza H . The probabilistic relevance framework: BM25 and beyond [J ] . Foundations and Trends® in Information Retrieval , 2009 , 3 ( 4 ): 333 - 389 .
Mandikal P , Mooney R . Sparse meets dense: A hybrid approach to enhance scientific document retrieval [PP/OL ] . V1. arXiv ( 2024-03-27 )[ 2024-01-08 ] . https://arXiv.org/abs/2401.04055 https://arXiv.org/abs/2401.04055 .
Lewis M , Liu Y H , Goyal N , et al . BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2020 : 7871 - 7880 . DOI: 10.18653/v1/2020.acl-main.703 http://dx.doi.org/10.18653/v1/2020.acl-main.703
Colin R , Noam S , Adam R , et al . Exploring the limits of transfer learning with a unified text-to-text transformer [J ] . Journal of Machine Learning Research , 2020 , 21 ( 140 ): 1 - 67 .
Chen J W , Lin H Y , Han X P , et al . Benchmarking large language models in retrieval-augmented generation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 16 ): 17754 - 17762 . DOI: 10.1609/aaai.v38i16.29728 http://dx.doi.org/10.1609/aaai.v38i16.29728
Kwiatkowski T , Palomaki J , Redfield O , et al . Natural questions: A benchmark for question answering research [J ] . Transactions of the Association for Computational Linguistics , 2019 , 7 : 453 - 466 . DOI: 10.1162/tacl_a_00276 http://dx.doi.org/10.1162/tacl_a_00276
Wang Z R , Yu Q H , Wei S D , et al . QAEncoder: Towards Aligned Representation Learning in Question Answering Systems [PP/OL ] . V1.arXiv ( 2024-09-30 )[ 2025-05-10 ] . https://arXiv.org/abs/2409.20434 https://arXiv.org/abs/2409.20434 . DOI: 10.18653/v1/2025.acl-long.217 http://dx.doi.org/10.18653/v1/2025.acl-long.217
Wang R B , Zhao Q F , Yan Y K , et al . DeepNote: Note-centric deep retrieval-augmented generation [PP/OL ] . V2. arXiv ( 2025-04-07 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2410.08821 https://doi.org/10.48550/arXiv.2410.08821 .
Zhao J H , Ji Z Y , Feng Y C , et al . Meta-chunking: Learning text segmentation and semantic completion via logical perception [PP/OL ] . V1.arXiv ( 2024-10-16 )[ 2025-05-10 ] . https://doi.org/10.48550/arXiv.2410.12788 https://doi.org/10.48550/arXiv.2410.12788 .
Es S , James J , Anke L E , et al . Ragas: Automated evaluation of retrieval augmented generation [PP/OL ] . V2.arXiv ( 2025-04-28 )[ 2025-05-10 ] . https://arXiv.org/abs/2309.15217 https://arXiv.org/abs/2309.15217 . DOI: 10.18653/v1/2024.eacl-demo.16 http://dx.doi.org/10.18653/v1/2024.eacl-demo.16
0
Views
30
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621