

浏览全部资源
扫码关注微信
中国科学技术大学计算机科学与技术学院,安徽合肥 230027
Received:31 May 2025,
Accepted:28 October 2025,
Published:25 December 2025
移动端阅览
詹慧悠, 倪宏秋, 谈海生, 等. 边缘侧检索增强生成系统高效缓存算法研究[J]. 电子学报, 2025, 53(12): 4444-4459.
ZHAN Hui-you, NI Hong-qiu, TAN Hai-sheng, et al. Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices[J]. Acta Electronica Sinica, 2025, 53(12): 4444-4459.
詹慧悠, 倪宏秋, 谈海生, 等. 边缘侧检索增强生成系统高效缓存算法研究[J]. 电子学报, 2025, 53(12): 4444-4459. DOI:10.12263/DZXB.20250461
ZHAN Hui-you, NI Hong-qiu, TAN Hai-sheng, et al. Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices[J]. Acta Electronica Sinica, 2025, 53(12): 4444-4459. DOI:10.12263/DZXB.20250461
随着大语言模型在边缘智能场景中的部署需求日益增长,检索增强生成(Retrieval-Augmented Generation, RAG)因其可降低模型依赖、提升领域知识覆盖与隐私保障能力而成为边缘侧部署的重要范式.然而,RAG系统在资源受限的边缘设备上仍面临显著挑战:大规模高维嵌入向量索引难以全量载入内存,频繁的缓存置换与低速存储访问导致检索延迟显著上升.当前嵌入向量获取主要依赖磁盘加载、在线生成和内存缓存三种路径,三者在延迟、计算开销与资源占用方面差异显著,缺乏统一高效的调度机制.针对上述挑战,本文提出一种专为资源受限边缘RAG系统定制的高效在线缓存算法——BP-Cache.该算法的核心创新在于引入了多路径访问代价建模与动态分级缓存管理机制,旨在通过细粒度的资源调度解决边缘侧的性能冲突.首先,本文通过分析揭示了边缘环境下嵌入向量访问的两个关键特性:一是向量生成代价呈现显著的长尾分布,大尺寸向量簇的在线计算延迟远超磁盘加载;二是访问模式存在极强的局部性与稀疏性,超过60%的向量在生命周期内仅被单次访问.基于此观测,BP-Cache引入轻量级准入过滤器机制,利用小缓存区作为“观察窗”暂存新到达向量,实现对低价值访问请求的快速旁路,有效遏制了缓存污染问题.同时,算法构建了代价-大小联合评分模型,将向量簇的在线生成计算代价、磁盘I/O读取延迟及其内存占用大小纳入统一评价体系,在缓存空间不足时,动态优先保留单位内存效益最高的向量簇,从而在无需预知未来访问序列的前提下逼近最优离线策略的性能.本文在基于NVIDIA Jetson AGX Orin的真实边缘平台上部署了该系统,并基于BEIR基准中的多个数据集开展了广泛的对比实验.实验结果表明,与EdgeRAG等当前最先进的边缘RAG方案相比,BP-Cache在多组数据集上平均降低检索延迟约29%,提升缓存命中率约21%,并显著优化了系统的长尾延迟表现.进一步的参数敏感性实验证实,该算法在不同缓存容量配置、小缓存配比及向量簇规模下均表现出优异的鲁棒性与适应性.
With the growing demand for deploying large language models (LLMs) in edge intelligence scenarios
retrieval-augmented generation (RAG) has emerged as a pivotal paradigm for edge-side deployment due to its ability to reduce dependency on large models while enhancing domain-specific knowledge coverage and privacy protection. However
RAG systems still face significant challenges on resource-constrained edge devices: large-scale
high-dimensional embedding vector indexes cannot be fully loaded into memory
and frequent cache evictions coupled with slow storage accesses lead to substantially increased retrieval latency. Currently
embedding vectors are primarily acquired through three approaches—disk loading
online generation
and in-memory caching—which differ significantly in terms of latency
computational overhead
and resource consumption
and lack a unified
efficient scheduling mechanism. To address these challenges
this paper proposes BP-Cache
an efficient online caching algorithm specifically tailored for resource-constrained edge RAG systems. The core innovation of BP-Cache lies in its multi-path access cost modeling and dynamic hierarchical cache management mechanism
designed to resolve performance conflicts at the edge through fine-grained resource scheduling. First
we uncover two key characteristics of embedding vector access patterns in edge environments through empirical analysis: (1) the cost of online vector generation exhibits a pronounced long-tail distribution
where the computational latency for large vector clusters far exceeds that of disk loading; and (2) access patterns show strong locality and sparsity
with over 60% of vectors accessed only once throughout their lifetime. Based on these observations
BP-Cache introduces a lightweight admission filtering mechanism that uses a small buffer cache as an “observation window” to temporarily hold newly arriving vectors
enabling rapid bypass of low-value access requests and effectively mitigating cache pollution. Simultaneously
the algorithm constructs a joint cost–size scoring model that integrates the online generation cost
disk I/O latency
and memory footprint of each vector cluster into a unified evaluation framework. When cache capacity is insufficient
BP-Cache dynamically prioritizes retaining vector clusters that deliver the highest utility per unit of memory
thereby approaching the performance of an optimal offline strategy—without requiring any knowledge of future access sequences. We implement and evaluate our system on a real-world edge platform based on the NVIDIA Jetson AGX Orin
conducting extensive experiments across multiple datasets from the BEIR benchmark suite. Results show that
compared to state-of-the-art edge RAG solutions such as EdgeRAG
BP-Cache reduces average retrieval latency by approximately 29% and improves cache hit rate by about 21% across multiple datasets
while significantly optimizing tail latency performance. Further sensitivity analyses confirm that the algorithm demonstrates excellent robustness and adaptability under varying cache capacities
small-buffer ratios
and vector cluster granularities.
GUO S T , LIU J D , YANG Y Y , et al . Energy-efficient dynamic computation offloading and cooperative task scheduling in mobile cloud computing [J ] . IEEE Transactions on Mobile Computing , 2019 , 18 ( 2 ): 319 - 333 .
WANG T , LIANG Y Z , SHEN X W , et al . Edge computing and sensor-cloud: Overview, solutions, and directions [J ] . ACM Computing Surveys , 2023 , 55 ( 13 s): 1 - 37 .
JIN Y Y , ZHONG R X , LONG S Q , et al . Efficient inference for pruned CNN models on mobile devices with holistic sparsity alignment [J ] . IEEE Transactions on Parallel and Distributed Systems , 2024 , 35 ( 11 ): 2208 - 2223 .
FAN W Q , DING Y J , NING L B , et al . A survey on RAG meeting LLMs: Towards retrieval-augmented large language models [C ] // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . New York : ACM , 2024 : 6491 - 6501 .
LEWIS P , PEREZ E , PIKTUS A , et al . Retrieval-augmented generation for knowledge-intensive NLPtasks [EB/OL ] . ( 2021-04-12 )[ 2025-10-01 ] . https: //arXiv.org/abs/2005.11401 https://arXiv.org/abs/2005.11401 .
YU H , GAN A R , ZHANG K , et al . Evaluation of retrieval-augmented generation: A survey [C ] // CCF Conference on Big Data . Singapore : Springer Nature Singapore , 2024 : 102 - 120 .
NING H S , LI Y F , SHI F F , et al . Heterogeneous edge computing open platforms and tools for Internet of Things [J ] . Future Generation Computer Systems , 2020 , 106 : 67 - 76 .
XU M W , FU Z , MA X , et al . From cloud to edge: A first look at public edge platforms [C ] // Proceedings of the 21st ACM Internet Measurement Conference . New York : ACM , 2021 : 37 - 53 .
KIM C , LEE C . Design of eMMC controller with multiple channels [C ] // 2016 International SoC Design Conference . Piscataway : IEEE , 2016 : 317 - 318 .
XU Q M , SIYAMWALA H , GHOSH M , et al . Performance analysis of NVMe SSDs and their implication on real world databases [C ] // Proceedings of the 8th ACM International Systems and Storage Conference . New York : ACM , 2015 : 2757684 .
SEEMAKHUPT K , LIU S H , KHAN S . EdgeRAG: Online-indexed RAG for edge devices [EB/OL ] . ( 2024-12-31 )[ 2025-10-01 ] . https://arXiv.org/abs/2412.21023 https://arXiv.org/abs/2412.21023 .
SIVIC , ZISSERMAN . Video Google: A text retrieval approach to object matching in videos [C ] // Proceedings of Ninth IEEE International Conference on Computer Vision . Piscataway : IEEE , 2008 : 1470 - 1477 .
SALTON G , WONG A , YANG C S . A vector space model for automatic indexing [J ] . Communications of the ACM , 1975 , 18 ( 11 ): 613 - 620 .
GITHUB . GitHub-facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors [EB/OL ] . ( 2025-10-15 )[ 2025-10-20 ] . https://github.com/huggingface/diffusers?tab=contributing-ov-file https://github.com/huggingface/diffusers?tab=contributing-ov-file .
MALKOV Y A , YASHUNIN D A . Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 4 ): 824 - 836 .
HAN K , XIAO A , WU E , et al . Transformer in transformer [EB/OL ] . ( 2021-10-26 )[ 2025-10-01 ] . https://arxiv.org/abs/2103. 00112 https://arxiv.org/abs/2103.00112 .
RAFFEL C , SHAZEER N , ROBERTS A , et al . Exploring the limits of transfer learning with a unified text-to-text transformer [J ] . Journal of Machine Learning Research , 2020 , 21 ( 1 ): 5485 - 5551 .
BROWN T B , MANN B , RYDER N , et al . Language models are few-shot learners [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 1877 - 1901 .
WEAVIATE . Weaviate Database [EB/OL ] . [ 2025-10-01 ] . https://weaviate.io/developers/academy/py/vector_index/flat https://weaviate.io/developers/academy/py/vector_index/flat .
ZHANG Y , JIN R , ZHOU Z H . Understanding bag-of-words model: A statistical framework [J ] . International Journal of Machine Learning and Cybernetics , 2010 , 1 ( 1 ): 43 - 52 .
YAN H , DING S , SUEL T . Inverted index compression and query processing with optimized document ordering [C ] // Proceedings of the 18th International Conference on World Wide Web . New York : ACM , 2009 : 401 - 410 .
ROBERTSON S , ZARAGOZA H . The probabilistic relevance framework: BM25 and beyond [J ] . Foundations and Trends in Information Retrieval , 2009 , 3 ( 4 ): 333 - 389 .
KARPUKHIN V , OGUZ B , MIN S , et al . Dense passage retrieval for open-domain question answering [EB/OL ] . ( 2020-09-30 )[ 2025-10-10 ] . https://arxiv.org/abs/2004.04906 https://arxiv.org/abs/2004.04906 .
FAN T Y , WANG J Y , REN X B , et al . MiniRAG: Towards extremely simple retrieval-augmented generation [EB/OL ] . ( 2025-01-26 )[ 2025-10-10 ] . https://arXiv.org/abs/2501.06713 https://arXiv.org/abs/2501.06713 .
LIU G Y , LIU Y Q , WANG J C , et al . Adaptive contextual caching for mobile edge large language model service [EB/OL ] . ( 2025-01-16 )[ 2025-10-02 ] . https://arXiv.org/abs/2501.09383 https://arXiv.org/abs/2501.09383 .
RAY S , PAN R , GU Z , et al . METIS: Fast quality-aware RAG systems with configuration adaptation [C ] // Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles . NewYork : ACM , 2025 : 606 - 622 .
WANG Y C , LIU S , LI Z F , et al . LEANN: A low-storage vector index [EB/OL ] . ( 2025-06-09 )[ 2025-10-10 ] . https://arXiv.org/abs/2506.08276 https://arXiv.org/abs/2506.08276 .
PARK T , LEE G , KIM M S . MobileRAG: A fast, memory-efficient, and energy-efficient method for on-device RAG [EB/OL ] . ( 2025-07-01 )[ 2025-10-10 ] . https://arXiv.org/abs/2507.01079 https://arXiv.org/abs/2507.01079 .
SLEATOR D D , TARJAN R E . Amortized efficiency of list update and paging rules [J ] . Communications of the ACM , 1985 , 28 ( 2 ): 202 - 208 .
LEE D , CHOI J , KIM J H , et al . LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies [J ] . IEEE Transactions on Computers , 2001 , 50 ( 12 ): 1352 - 1361 .
JOHNSON T , SHASHA D . 2 Q: A low overhead high performance buffer management replacement algorithm[C ] // Proceedings of the 20th International Conference on Very Large Data Bases . San Francisco : Morgan Kaufmann Publishers Inc , 1994 : 439 - 450 .
MEGIDDO N , MODHA D S . ARC: A self-tuning, low overhead replacement cache [C ] // 2nd USENIX Conference on File and Storage Technologies (FAST 03) . California : USENIX Association , 2003 : 115 - 130 .
YANG J C , ZHANG Y Z , QIU Z Y , et al . FIFO queues are all you need for cache eviction [C ] // Proceedings of the 29th Symposium on Operating Systems Principles . New York : ACM , 2023 : 130 - 149 .
TAN H S , JIANG S H , HAN Z H , et al . Camul: Online caching on multiple caches with relaying and bypassing [C ] // IEEE INFOCOM 2019 - IEEE Conference on Computer Communications . Piscataway : IEEE , 2019 : 244 - 252 .
HU X Y , RAMADAN E , YE W , et al . Raven: Belady-guided, predictive (deep) learning for in-memory and content caching [C ] // Proceedings of the 18th International Conference on Emerging Networking Experiments and Technologies . New York : ACM , 2022 : 72 - 90 .
ZHOU W B , NIU Z X , XIONG Y Q , et al . 3L-Cache: Low overhead and precise learning-based eviction policy for caches [C ] // 23rd USENIX Conference on File and Storage Technologies (FAST 25) . California : USENIX Association , 2025 : 237 - 254 .
RODRIGUEZ L V , YUSUF F B , LYONS S , et al . Learning cache replacement with CACHEUS [C ] // 19th USENIX Conference on File and Storage Technologies . California : USENIX Association , 2021 : 341 - 354 .
CHEN J Y , SHARMA N , KHAN T , et al . Darwin: Flexible learning-based CDN caching [C ] // Proceedings of the ACM SIGCOMM 2023 Conference . New York : ACM , 2023 : 981 - 999 .
王玉庆 , 杨秋松 , 李明树 . 基于指令流混合模式学习的缓存预取算法 [J ] . 电子学报 , 2023 , 51 ( 2 ): 342 - 354 .
WANG Y Q , YANG Q S , LI M S . A cache prefetching mechanism based on hybrid pattern learning of instruction flow [J ] . Acta Electronica Sinica , 2023 , 51 ( 2 ): 342 - 354 . (in Chinese)
ZHANG Z Y , SHENG Y , ZHOU T Y , et al . H 2 O: Heavy-hitter oracle for efficient generative inference of large language models [EB/OL ] . ( 2023-12-18 )[ 2025-10-01 ] . https://arXiv.org/abs/2306.14048 https://arXiv.org/abs/2306.14048 .
WANG J H , HAN J B , WEI X D , et al . KVCachecache in the wild: Characterizing and optimizing KVCache cache at a large cloud provider [EB/OL ] . ( 2025-07-23 )[ 2025-10-01 ] . https://arXiv.org/abs/2506.02634 https://arXiv.org/abs/2506.02634 .
GAO B , HE Z M , SHARMA P , et al . Cost-efficient large language model serving for multi-turn conversations with Cached Attention [C ] // 2024 USENIX Annual Technical Conference . California : USENIX Association , 2024 : 111 - 126 .
YAO J Y , LI H C , LIU Y H , et al . CacheBlend: Fast large language model serving for RAG with cached knowledge fusion [C ] // Proceedings of the Twentieth European Conference on Computer Systems . New York : ACM , 2025 : 94 - 109 .
GIM I , CHEN G J , LEE SS , et al . Prompt cache: Modular attention reuse for low-latency inference [EB/OL ] . ( 2024-04-25 )[ 2025-10-10 ] . https://arXiv.org/abs/2311.04934 https://arXiv.org/abs/2311.04934 .
LIU G D , LI C W , ZHAO J R , et al . ClusterKV: Manipulating LLM KV cache in semantic space for recallable compression [C ] // 2025 62nd ACM/IEEE Design Automation Conference . Piscataway : IEEE , 2025 : 1 - 7 .
ADAMASZEK A , CZUMAJ A , ENGLERT M , et al . An O (log k )-competitive algorithm for generalized caching [J ] . ACM Transactions on Algorithms , 2018 , 15 ( 1 ): 1 - 18 .
EPSTEIN L , IMREH C , LEVIN A , et al . Online file caching with rejection penalties [J ] . Algorithmica , 2015 , 71 ( 2 ): 279 - 306 .
0
Views
12
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621