Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices

ZHAN Hui-you; NI Hong-qiu; TAN Hai-sheng; WANG Tian-zhu; LI Xiang-yang

doi:10.12263/DZXB.20250461

您当前的位置：

首页 >

文章列表页 >

Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices

PAPERS | 更新时间：2026-04-24

- Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices
- ACTA ELECTRONICA SINICA Vol. 53, Issue 12, Pages: 4444-4459(2025)
- 作者机构：
  
  中国科学技术大学计算机科学与技术学院，安徽合肥 230027
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62132009)
- DOI：10.12263/DZXB.20250461
  CLC： TP303;TP393
- Received：31 May 2025，
  
  Accepted：28 October 2025，
  
  Published：25 December 2025
- 稿件说明：
移动端阅览
詹慧悠, 倪宏秋, 谈海生, 等. 边缘侧检索增强生成系统高效缓存算法研究[J]. 电子学报, 2025, 53(12): 4444-4459.

ZHAN Hui-you, NI Hong-qiu, TAN Hai-sheng, et al. Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices[J]. Acta Electronica Sinica, 2025, 53(12): 4444-4459.
詹慧悠, 倪宏秋, 谈海生, 等. 边缘侧检索增强生成系统高效缓存算法研究[J]. 电子学报, 2025, 53(12): 4444-4459. DOI：10.12263/DZXB.20250461

ZHAN Hui-you, NI Hong-qiu, TAN Hai-sheng, et al. Efficient Caching Algorithm for Retrieval Augmented Generation Systems on Edge Devices[J]. Acta Electronica Sinica, 2025, 53(12): 4444-4459. DOI：10.12263/DZXB.20250461

摘要

随着大语言模型在边缘智能场景中的部署需求日益增长，检索增强生成（Retrieval-Augmented Generation， RAG）因其可降低模型依赖、提升领域知识覆盖与隐私保障能力而成为边缘侧部署的重要范式.然而，RAG系统在资源受限的边缘设备上仍面临显著挑战：大规模高维嵌入向量索引难以全量载入内存，频繁的缓存置换与低速存储访问导致检索延迟显著上升.当前嵌入向量获取主要依赖磁盘加载、在线生成和内存缓存三种路径，三者在延迟、计算开销与资源占用方面差异显著，缺乏统一高效的调度机制.针对上述挑战，本文提出一种专为资源受限边缘RAG系统定制的高效在线缓存算法——BP-Cache.该算法的核心创新在于引入了多路径访问代价建模与动态分级缓存管理机制，旨在通过细粒度的资源调度解决边缘侧的性能冲突.首先，本文通过分析揭示了边缘环境下嵌入向量访问的两个关键特性：一是向量生成代价呈现显著的长尾分布，大尺寸向量簇的在线计算延迟远超磁盘加载；二是访问模式存在极强的局部性与稀疏性，超过60%的向量在生命周期内仅被单次访问.基于此观测，BP-Cache引入轻量级准入过滤器机制，利用小缓存区作为“观察窗”暂存新到达向量，实现对低价值访问请求的快速旁路，有效遏制了缓存污染问题.同时，算法构建了代价-大小联合评分模型，将向量簇的在线生成计算代价、磁盘I/O读取延迟及其内存占用大小纳入统一评价体系，在缓存空间不足时，动态优先保留单位内存效益最高的向量簇，从而在无需预知未来访问序列的前提下逼近最优离线策略的性能.本文在基于NVIDIA Jetson AGX Orin的真实边缘平台上部署了该系统，并基于BEIR基准中的多个数据集开展了广泛的对比实验.实验结果表明，与EdgeRAG等当前最先进的边缘RAG方案相比，BP-Cache在多组数据集上平均降低检索延迟约29%，提升缓存命中率约21%，并显著优化了系统的长尾延迟表现.进一步的参数敏感性实验证实，该算法在不同缓存容量配置、小缓存配比及向量簇规模下均表现出优异的鲁棒性与适应性.

Abstract

With the growing demand for deploying large language models (LLMs) in edge intelligence scenarios

retrieval-augmented generation (RAG) has emerged as a pivotal paradigm for edge-side deployment due to its ability to reduce dependency on large models while enhancing domain-specific knowledge coverage and privacy protection. However

RAG systems still face significant challenges on resource-constrained edge devices: large-scale

high-dimensional embedding vector indexes cannot be fully loaded into memory

and frequent cache evictions coupled with slow storage accesses lead to substantially increased retrieval latency. Currently

embedding vectors are primarily acquired through three approaches—disk loading

online generation

and in-memory caching—which differ significantly in terms of latency

computational overhead

and resource consumption

and lack a unified

efficient scheduling mechanism. To address these challenges

this paper proposes BP-Cache

an efficient online caching algorithm specifically tailored for resource-constrained edge RAG systems. The core innovation of BP-Cache lies in its multi-path access cost modeling and dynamic hierarchical cache management mechanism

designed to resolve performance conflicts at the edge through fine-grained resource scheduling. First

we uncover two key characteristics of embedding vector access patterns in edge environments through empirical analysis: (1) the cost of online vector generation exhibits a pronounced long-tail distribution

where the computational latency for large vector clusters far exceeds that of disk loading; and (2) access patterns show strong locality and sparsity

with over 60% of vectors accessed only once throughout their lifetime. Based on these observations

BP-Cache introduces a lightweight admission filtering mechanism that uses a small buffer cache as an “observation window” to temporarily hold newly arriving vectors

enabling rapid bypass of low-value access requests and effectively mitigating cache pollution. Simultaneously

the algorithm constructs a joint cost–size scoring model that integrates the online generation cost

disk I/O latency

and memory footprint of each vector cluster into a unified evaluation framework. When cache capacity is insufficient

BP-Cache dynamically prioritizes retaining vector clusters that deliver the highest utility per unit of memory

thereby approaching the performance of an optimal offline strategy—without requiring any knowledge of future access sequences. We implement and evaluate our system on a real-world edge platform based on the NVIDIA Jetson AGX Orin

conducting extensive experiments across multiple datasets from the BEIR benchmark suite. Results show that

compared to state-of-the-art edge RAG solutions such as EdgeRAG

BP-Cache reduces average retrieval latency by approximately 29% and improves cache hit rate by about 21% across multiple datasets

while significantly optimizing tail latency performance. Further sensitivity analyses confirm that the algorithm demonstrates excellent robustness and adaptability under varying cache capacities

small-buffer ratios

and vector cluster granularities.

关键词

Keywords

references

GUO S T , LIU J D , YANG Y Y , et al . Energy-efficient dynamic computation offloading and cooperative task scheduling in mobile cloud computing [J ] . IEEE Transactions on Mobile Computing , 2019 , 18 ( 2 ): 319 - 333 .

WANG T , LIANG Y Z , SHEN X W , et al . Edge computing and sensor-cloud: Overview, solutions, and directions [J ] . ACM Computing Surveys , 2023 , 55 ( 13 s): 1 - 37 .

JIN Y Y , ZHONG R X , LONG S Q , et al . Efficient inference for pruned CNN models on mobile devices with holistic sparsity alignment [J ] . IEEE Transactions on Parallel and Distributed Systems , 2024 , 35 ( 11 ): 2208 - 2223 .

FAN W Q , DING Y J , NING L B , et al . A survey on RAG meeting LLMs: Towards retrieval-augmented large language models [C ] // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . New York : ACM , 2024 : 6491 - 6501 .

LEWIS P , PEREZ E , PIKTUS A , et al . Retrieval-augmented generation for knowledge-intensive NLPtasks [EB/OL ] . ( 2021-04-12 )[ 2025-10-01 ] . https: //arXiv.org/abs/2005.11401 https://arXiv.org/abs/2005.11401 .

YU H , GAN A R , ZHANG K , et al . Evaluation of retrieval-augmented generation: A survey [C ] // CCF Conference on Big Data . Singapore : Springer Nature Singapore , 2024 : 102 - 120 .

NING H S , LI Y F , SHI F F , et al . Heterogeneous edge computing open platforms and tools for Internet of Things [J ] . Future Generation Computer Systems , 2020 , 106 : 67 - 76 .

XU M W , FU Z , MA X , et al . From cloud to edge: A first look at public edge platforms [C ] // Proceedings of the 21st ACM Internet Measurement Conference . New York : ACM , 2021 : 37 - 53 .

KIM C , LEE C . Design of eMMC controller with multiple channels [C ] // 2016 International SoC Design Conference . Piscataway : IEEE , 2016 : 317 - 318 .

XU Q M , SIYAMWALA H , GHOSH M , et al . Performance analysis of NVMe SSDs and their implication on real world databases [C ] // Proceedings of the 8th ACM International Systems and Storage Conference . New York : ACM , 2015 : 2757684 .

SEEMAKHUPT K , LIU S H , KHAN S . EdgeRAG: Online-indexed RAG for edge devices [EB/OL ] . ( 2024-12-31 )[ 2025-10-01 ] . https://arXiv.org/abs/2412.21023 https://arXiv.org/abs/2412.21023 .

SIVIC , ZISSERMAN . Video Google: A text retrieval approach to object matching in videos [C ] // Proceedings of Ninth IEEE International Conference on Computer Vision . Piscataway : IEEE , 2008 : 1470 - 1477 .

SALTON G , WONG A , YANG C S . A vector space model for automatic indexing [J ] . Communications of the ACM , 1975 , 18 ( 11 ): 613 - 620 .

GITHUB . GitHub-facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors [EB/OL ] . ( 2025-10-15 )[ 2025-10-20 ] . https://github.com/huggingface/diffusers?tab=contributing-ov-file https://github.com/huggingface/diffusers?tab=contributing-ov-file .

MALKOV Y A , YASHUNIN D A . Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 4 ): 824 - 836 .

HAN K , XIAO A , WU E , et al . Transformer in transformer [EB/OL ] . ( 2021-10-26 )[ 2025-10-01 ] . https://arxiv.org/abs/2103. 00112 https://arxiv.org/abs/2103.00112 .

RAFFEL C , SHAZEER N , ROBERTS A , et al . Exploring the limits of transfer learning with a unified text-to-text transformer [J ] . Journal of Machine Learning Research , 2020 , 21 ( 1 ): 5485 - 5551 .

BROWN T B , MANN B , RYDER N , et al . Language models are few-shot learners [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 1877 - 1901 .

WEAVIATE . Weaviate Database [EB/OL ] . [ 2025-10-01 ] . https://weaviate.io/developers/academy/py/vector_index/flat https://weaviate.io/developers/academy/py/vector_index/flat .

ZHANG Y , JIN R , ZHOU Z H . Understanding bag-of-words model: A statistical framework [J ] . International Journal of Machine Learning and Cybernetics , 2010 , 1 ( 1 ): 43 - 52 .

YAN H , DING S , SUEL T . Inverted index compression and query processing with optimized document ordering [C ] // Proceedings of the 18th International Conference on World Wide Web . New York : ACM , 2009 : 401 - 410 .

ROBERTSON S , ZARAGOZA H . The probabilistic relevance framework: BM25 and beyond [J ] . Foundations and Trends in Information Retrieval , 2009 , 3 ( 4 ): 333 - 389 .

KARPUKHIN V , OGUZ B , MIN S , et al . Dense passage retrieval for open-domain question answering [EB/OL ] . ( 2020-09-30 )[ 2025-10-10 ] . https://arxiv.org/abs/2004.04906 https://arxiv.org/abs/2004.04906 .

FAN T Y , WANG J Y , REN X B , et al . MiniRAG: Towards extremely simple retrieval-augmented generation [EB/OL ] . ( 2025-01-26 )[ 2025-10-10 ] . https://arXiv.org/abs/2501.06713 https://arXiv.org/abs/2501.06713 .

LIU G Y , LIU Y Q , WANG J C , et al . Adaptive contextual caching for mobile edge large language model service [EB/OL ] . ( 2025-01-16 )[ 2025-10-02 ] . https://arXiv.org/abs/2501.09383 https://arXiv.org/abs/2501.09383 .

RAY S , PAN R , GU Z , et al . METIS: Fast quality-aware RAG systems with configuration adaptation [C ] // Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles . NewYork : ACM , 2025 : 606 - 622 .

WANG Y C , LIU S , LI Z F , et al . LEANN: A low-storage vector index [EB/OL ] . ( 2025-06-09 )[ 2025-10-10 ] . https://arXiv.org/abs/2506.08276 https://arXiv.org/abs/2506.08276 .

PARK T , LEE G , KIM M S . MobileRAG: A fast, memory-efficient, and energy-efficient method for on-device RAG [EB/OL ] . ( 2025-07-01 )[ 2025-10-10 ] . https://arXiv.org/abs/2507.01079 https://arXiv.org/abs/2507.01079 .

SLEATOR D D , TARJAN R E . Amortized efficiency of list update and paging rules [J ] . Communications of the ACM , 1985 , 28 ( 2 ): 202 - 208 .

LEE D , CHOI J , KIM J H , et al . LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies [J ] . IEEE Transactions on Computers , 2001 , 50 ( 12 ): 1352 - 1361 .

JOHNSON T , SHASHA D . 2 Q: A low overhead high performance buffer management replacement algorithm[C ] // Proceedings of the 20th International Conference on Very Large Data Bases . San Francisco : Morgan Kaufmann Publishers Inc , 1994 : 439 - 450 .

MEGIDDO N , MODHA D S . ARC: A self-tuning, low overhead replacement cache [C ] // 2nd USENIX Conference on File and Storage Technologies (FAST 03) . California : USENIX Association , 2003 : 115 - 130 .

YANG J C , ZHANG Y Z , QIU Z Y , et al . FIFO queues are all you need for cache eviction [C ] // Proceedings of the 29th Symposium on Operating Systems Principles . New York : ACM , 2023 : 130 - 149 .

TAN H S , JIANG S H , HAN Z H , et al . Camul: Online caching on multiple caches with relaying and bypassing [C ] // IEEE INFOCOM 2019 - IEEE Conference on Computer Communications . Piscataway : IEEE , 2019 : 244 - 252 .

HU X Y , RAMADAN E , YE W , et al . Raven: Belady-guided, predictive (deep) learning for in-memory and content caching [C ] // Proceedings of the 18th International Conference on Emerging Networking Experiments and Technologies . New York : ACM , 2022 : 72 - 90 .

ZHOU W B , NIU Z X , XIONG Y Q , et al . 3L-Cache: Low overhead and precise learning-based eviction policy for caches [C ] // 23rd USENIX Conference on File and Storage Technologies (FAST 25) . California : USENIX Association , 2025 : 237 - 254 .

RODRIGUEZ L V , YUSUF F B , LYONS S , et al . Learning cache replacement with CACHEUS [C ] // 19th USENIX Conference on File and Storage Technologies . California : USENIX Association , 2021 : 341 - 354 .

CHEN J Y , SHARMA N , KHAN T , et al . Darwin: Flexible learning-based CDN caching [C ] // Proceedings of the ACM SIGCOMM 2023 Conference . New York : ACM , 2023 : 981 - 999 .

王玉庆 , 杨秋松 , 李明树 . 基于指令流混合模式学习的缓存预取算法 [J ] . 电子学报 , 2023 , 51 ( 2 ): 342 - 354 .

WANG Y Q , YANG Q S , LI M S . A cache prefetching mechanism based on hybrid pattern learning of instruction flow [J ] . Acta Electronica Sinica , 2023 , 51 ( 2 ): 342 - 354 . (in Chinese)

ZHANG Z Y , SHENG Y , ZHOU T Y , et al . H 2 O: Heavy-hitter oracle for efficient generative inference of large language models [EB/OL ] . ( 2023-12-18 )[ 2025-10-01 ] . https://arXiv.org/abs/2306.14048 https://arXiv.org/abs/2306.14048 .

WANG J H , HAN J B , WEI X D , et al . KVCachecache in the wild: Characterizing and optimizing KVCache cache at a large cloud provider [EB/OL ] . ( 2025-07-23 )[ 2025-10-01 ] . https://arXiv.org/abs/2506.02634 https://arXiv.org/abs/2506.02634 .

GAO B , HE Z M , SHARMA P , et al . Cost-efficient large language model serving for multi-turn conversations with Cached Attention [C ] // 2024 USENIX Annual Technical Conference . California : USENIX Association , 2024 : 111 - 126 .

YAO J Y , LI H C , LIU Y H , et al . CacheBlend: Fast large language model serving for RAG with cached knowledge fusion [C ] // Proceedings of the Twentieth European Conference on Computer Systems . New York : ACM , 2025 : 94 - 109 .

GIM I , CHEN G J , LEE SS , et al . Prompt cache: Modular attention reuse for low-latency inference [EB/OL ] . ( 2024-04-25 )[ 2025-10-10 ] . https://arXiv.org/abs/2311.04934 https://arXiv.org/abs/2311.04934 .

LIU G D , LI C W , ZHAO J R , et al . ClusterKV: Manipulating LLM KV cache in semantic space for recallable compression [C ] // 2025 62nd ACM/IEEE Design Automation Conference . Piscataway : IEEE , 2025 : 1 - 7 .

ADAMASZEK A , CZUMAJ A , ENGLERT M , et al . An O (log k )-competitive algorithm for generalized caching [J ] . ACM Transactions on Algorithms , 2018 , 15 ( 1 ): 1 - 18 .

EPSTEIN L , IMREH C , LEVIN A , et al . Online file caching with rejection penalties [J ] . Algorithmica , 2015 , 71 ( 2 ): 279 - 306 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

A Lightweight Neural Speech Compression Method for Edge Devices

Low-Cost Federated Learning Based on Lightweight Self-Distillation

PUF-Based Multi-Receiver Anonymous Signcryption Scheme in Edge Computing

Dual-Layer Federated Learning Based Edge Collaborative Computing Mechanism for High Dynamic Internet of Vehicle Businesses

Related Author

LU Yu

FU Yong-jian

DING Dian

PAN Hao

XUE Guang-tao

REN Ju

LIU Song

LUO Yang-yu

Related Institution

Department of Computer Science and Technology, Tsinghua University

School of Computer Science and Engineering, Central South University

School of Computer Science, Shanghai Jiao Tong University

Ministry of Education Key Laboratory of Data and Intelligent System Security

College of Cyber Science, Nankai University

⁰