

浏览全部资源
扫码关注微信
1.浙江大学计算机科学与技术学院,浙江杭州 310063
2.福州大学计算机与大数据学院,福建福州 350108
Received:29 February 2024,
Revised:2024-08-20,
Published:25 October 2024
移动端阅览
刘宏岩, 张栋, 吴春明. 基于可编程交换机的网内灰色故障检测技术研究进展[J]. 电子学报, 2024, 52(10): 3613-3622.
LIU Hong-yan, ZHANG Dong, WU Chun-ming. Empowering In-Network Gray Failure Detection with Programmable Switches[J]. Acta Electronica Sinica, 2024, 52(10): 3613-3622.
刘宏岩, 张栋, 吴春明. 基于可编程交换机的网内灰色故障检测技术研究进展[J]. 电子学报, 2024, 52(10): 3613-3622. DOI:10.12263/DZXB.20240199
LIU Hong-yan, ZHANG Dong, WU Chun-ming. Empowering In-Network Gray Failure Detection with Programmable Switches[J]. Acta Electronica Sinica, 2024, 52(10): 3613-3622. DOI:10.12263/DZXB.20240199
灰色故障是指对生产网络产生细微影响的交换机故障.然而,当这些轻微故障相互叠加或与新增故障叠加时,可能会导致整个生产网络的瘫痪.因此,检测灰色故障对生产网络的稳定性至关重要.传统解决方案关注的是在控制平面收集数据平面交换机中的流记录,并对其进行处理以检测灰色故障.然而,此类解决方案存在着不足:(1)缓存和处理大量的流记录会引入庞大的资源开销;(2)较高的检测时延无法保证灰色故障检测的时效性.近年来,可编程交换机的出现为灰色故障检测技术带来了新机遇:网络管理员可以将灰色故障检测算法部署运行至可编程交换机的线速ASIC流水线上,从而支持低开销、低时延、高精度的网内灰色故障检测技术.本文针对基于可编程交换机的网内灰色故障检测技术进行综述,在对灰色故障的概念、普遍性及对生产网络的危害进行描述的基础上,分析与讨论了现有基于可编程交换机的网内灰色故障检测技术的研究现状与进展,详细介绍每项技术的工作原理及流程,搭建真实的实验平台评估各项技术的检测指标,在文末指出了现有技术所面临的问题与挑战.
Gray failures are micro switch malfunctions that have a subtle impact on production networks. However
when these micro malfunctions are superimposed on each other or on a new malfunction
they can lead to paralysis of production networks. Thus
the detection of gray failures is essential to the stability of production networks. Prior methods focus on using the control plane to collect flow records from data plane switches and process them to detect packet loss. However
they fall short due to (1) their high resource overhead of handling with massive flow records and (2) non-trivial delays that result in out-of-date failure detection. Recently
the emergence of programmable switches provides a promising alternative solution: the detection of gray failures can be offloaded to line-rate switch ASIC pipelines
enabling low-cost
low-latency
and high-accuracy in-network gray failure detection. This paper presents an illustrative survey of programmable switch-assisted techniques in in-network gray failure detection. First
we describe the concept of gray failures
their prevalence
and their impact to production networks. Second
we analyze and discuss the characteristics of state-of-the-art gray failures detection techniques built on programmable switches. Third
we illustrate the principle and workflow of each detection technique. Fourth
we conduct a real-world testbed to evaluate the metrics of each detection technique. Finally
we highlight the problems and challenges faced by existing techniques.
俞波 , 杨珉 , 王治 , 等 . 选择传递攻击中的异常丢包检测 [J ] . 计算机学报 , 2006 , 29 ( 9 ): 1542 - 1552 .
YU B , YANG M , WANG Z , et al . Identify abnormal packet loss in selective forwarding attacks [J ] . Chinese Journal of Computers , 2006 , 29 ( 9 ): 1542 - 1552 . (in Chinese)
MOLERO E C , VISSICCHIO S , VANBEVER L . FAst in-network gray failure detection for ISPs [C ] // Proceedings of the ACM SIGCOMM 2022 Conference . New York : ACM . 2022 , 677 - 692 .
HUANG P , GUO C X , ZHOU L D , et al . Gray failure: The Achilles' heel of cloud-scale systems [C ] // Proceedings of the 16th Workshop on Hot Topics in Operating Systems . New York : ACM , 2017 : 150 - 155 .
ARZANI B , CIRACI S , CHAMON L , et al . 007 : Democratically finding the cause of packet drops [C ] // 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) . Renton : USENIX Association , 2018: 419 - 435 .
ZHOU Y , SUN C , LIU H H , et al . Flow event telemetry on programmable data plane [C ] // Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication . New York : ACM , 2020 : 76 - 89 .
RASLEY J , STEPHENS B , DIXON C , et al . Planck: Millisecond-scale monitoring and control for commodity networks [J ] . ACM SIGCOMM Computer Communication Review , 2014 , 44 ( 4 ): 407 - 418 .
ZHU Y B , KANG N X , CAO J X , et al . Packet-level telemetry in large datacenter networks [J ] . ACM SIGCOMM Computer Communication Review , 2015 , 45 ( 4 ): 479 - 491 .
林耘森箫 , 毕军 , 周禹 , 等 . 基于P4的可编程数据平面研究及其应用 [J ] . 计算机学报 , 2019 , 42 ( 11 ): 2539 - 2560 .
LIN Y , BI J , ZHOU Y , et al . Research and applications of programmable data plane based on P4 [J ] . Chinese Journal of Computers , 2019 , 42 ( 11 ): 2539 - 2560 . (in Chinese)
ESTAN C , KEYS K , MOORE D , et al . Building a better NetFlow [J ] . ACM SIGCOMM Computer Communication Review , 2004 , 34 ( 4 ): 245 - 256 .
杨宏宇 , 王泽霖 , 张良 , 等 . 面向物联网的多协议僵尸网络检测方法 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1198 - 1206 .
YANG H Y , WANG Z L , ZHANG L , et al . A multi-protocol botnet detection method for IoT [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1198 - 1206 . (in Chinese)
王鹃 , 王江 , 焦虹阳 , 等 . 一种基于OpenFlow的SDN访问控制策略实时冲突检测与解决方法 [J ] . 计算机学报 , 2015 , 38 ( 4 ): 872 - 883 .
WANG J , WANG J , JIAO H Y , et al . A method of openflow-based real-time conflict detection and resolution for SDN access control policies [J ] . Chinese Journal of Computers , 2015 , 38 ( 4 ): 872 - 883 . (in Chinese)
CHEN X , HUANG Q , ZHANG D , et al . ApproSync: Approximate state synchronization for programmable networks [C ] // 2020 IEEE 28th International Conference on Network Protocols (ICNP) . Piscataway : IEEE , 2020 : 1 - 12 .
BOSSHART P , GIBB G , KIM H S , et al . Forwarding metamorphosis [J ] . ACM SIGCOMM Computer Communication Review , 2013 , 43 ( 4 ): 99 - 110 .
BOSSHART P , DALY D , GIBB G , et al . P4: Programming protocol-independent packet processors [J ] . ACM SIGCOMM Computer Communication Review , 2014 , 44 ( 3 ): 87 - 95 .
叶进 , 王建新 . 异构网络中丢包识别研究综述 [J ] . 计算机科学 , 2006 , 33 ( 12 ): 19 - 22, 33 .
YE J , WANG J X . The research of loss differentiation algorithm in heterogeneous networks [J ] . Computer Science , 2006 , 33 ( 12 ): 19 - 22, 33 . (in Chinese)
张昕怡 , 潘恒 , 谢高岗 . 可编程网络数据平面技术进展 [J ] . 电信科学 , 2022 , 38 ( 6 ): 42 - 50 .
ZHANG X Y , PAN H , XIE G G . Progress in programmable network data plane [J ] . Telecommunications Science , 2022 , 38 ( 6 ): 42 - 50 . (in Chinese)
CHEN X , WU C M , LIU X , et al . Empowering network security with programmable switches: A comprehensive survey [J ] . IEEE Communications Surveys & Tutorials , 2023 , 25 ( 3 ): 1653 - 1704 .
魏祥麟 , 陈鸣 , 范建华 , 等 . 数据中心网络的体系结构? [J ] . 软件学报 , 2013 , 24 ( 2 ): 295 - 316 .
WEI X L , CHEN M , FAN J H , et al . Architecture of the data center network [J ] . Journal of Software , 2013 , 24 ( 2 ): 295 - 316 . (in Chinese)
李阿妮 , 张晓 , 赵晓南 , 等 . 面向IaaS的云计算系统可用性评估 [J ] . 计算机科学 , 2016 , 43 ( 10 ): 33 - 39 .
LI A N , ZHANG X , ZHAO X N , et al . Cloud computing system availability evaluation for IaaS [J ] . Computer Science , 2016 , 43 ( 10 ): 33 - 39 . (in Chinese)
ZHANG K , SU W , SHI H , et al . GrayINT—detection and localization of gray failures via hybrid in-band network telemetry [C ] // 2023 24th Asia-Pacific Network Operations and Management Symposium (APNOMS) . Piscataway : IEEE , 2023 : 405 - 408 .
LIU J , HALLAHAN W , SCHLESINGER C , et al . P 4 V: Practical verification for programmable data planes [C ] // Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication . New York : ACM , 2018 : 490 - 503 .
LI Y L , MIAO R , KIM C , et al . LossRadar: Fast detection of lost packets in data center networks [C ] // Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies . New York : ACM , 2016 : 481 - 495 .
张朝昆 , 崔勇 , 唐翯祎 , 等 . 软件定义网络(SDN)研究进展 [J ] . 软件学报 , 2015 , 26 ( 1 ): 62 - 81 .
ZHANG C K , CUI Y , TANG H Y , et al . State-of-the-art survey on software-defined networking (SDN) [J ] . Journal of Software , 2015 , 26 ( 1 ): 62 - 81 . (in Chinese)
LIU Z , NAMKUNG H , NIKOLAIDIS G , et al . Jaqen: A high-performance switch-native approach for detecting and mitigating volumetric DDoS attacks with programmable switches [C ] // 30th USENIX Security Symposium (USENIX Security 21) . Berkley : USENIX Association , 2021 : 3829 - 3846 .
毕军 . P4与可编程数据平面: 回顾与展望 [J ] . 中国计算机学会通讯 , 2019 , 15 ( 3 ): 76 - 80 .
尼克·麦克欧文 , 金昶勳 , 高荣新 . 用P4对数据平面进行编程 [J ] . 中国计算机学会通讯 , 2016 , 12 ( 7 ): 12 - 20 .
LI Y , MIAO R , KIM C , et al . FlowRadar: a better NetFlow for data centers [C ] // 13th USENIX symposium on networked systems design and implementation (NSDI 16) . Berkeley : USENIX Association , 2016 : 311 - 324 .
NARAYANA S , SIVARAMAN A , NATHAN V , et al . Language-directed hardware design for network performance monitoring [C ] // Proceedings of the Conference of the ACM Special Interest Group on Data Communication . New York : ACM , 2017 : 85 - 98 .
HOLTERBACH T , MOLERO E C , APOSTOLAKI M , et al . Blink: Fast connectivity recovery entirely in the data plane [C ] // 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19) . Boston : USENIX Association , 2019 : 161 - 176 .
田春生 , 陈雷 , 王源 , 等 . 面向FPGA的布局与布线技术研究综述 [J ] . 电子学报 , 2022 , 50 ( 5 ): 1243 - 1254 .
TIAN C S , CHEN L , WANG Y , et al . Review on Technology of Placement and Routing for the FPGA [J ] . Acta Electronica Sinica , 2022 , 50 ( 5 ): 1243 - 1254 . (in Chinese)
王鹏 , 邹彬 , 刘金枝 , 等 . 基于Xilinx型FPGA系统单粒子效应评估方法研究 [J ] . 电子学报 , 2022 , 50 ( 11 ): 2716 - 2721 .
WANG P , ZOU B , LIU J Z , et al . Study on single event effect evaluation method based on Xilinx FPGA system [J ] . Acta Electronica Sinica , 2022 , 50 ( 11 ): 2716 - 2721 . (in Chinese)
AGARWAL A , LIU Z , SESHAN S . HeteroSketch: Coordinating network-wide monitoring in heterogeneous and dynamic networks [C ] // 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) . Renton : USENIX Association , 2022 : 719 - 741 .
贾统 , 李影 , 吴中海 . 基于日志数据的分布式软件系统故障诊断综述 [J ] . 软件学报 , 2020 , 31 ( 7 ): 1997 - 2018 .
JIA T , LI Y , WU Z H . Survey of state-of-the-art log-based failure diagnosis [J ] . Journal of Software , 2020 , 31 ( 7 ): 1997 - 2018 . (in Chinese)
0
Views
3
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621