1.北京大学计算机学院,北京100871
2.军事科学院战略评估咨询中心,北京100091
3.国防科技大学计算机学院,湖南长沙410073
4.军事科学院,北京100091
[ "汪子尧 男,2000年3月出生于北京市。现为北京大学计算机学院博士研究生。主要研究方向为人工智能。E-mail: ziyaowang@stu.pku.edu.cn" ]
[ "田瑜 女,1992年5月出生于江苏省无锡市。现为军事科学院战略评估咨询中心助理研究员。主要研究方向为计算机视觉、大语言模型。E-mail: tianyu10@alumni.nudt.edu.cn" ]
[ "黄俊杰 男,1990年9月出生于湖南省怀化市。国防科技大学计算机学院副研究员。主要研究方向为多模态学习、模型驱动深度学习、视觉增强与感知等领域。E-mail: jjhuang@nudt.edu.cn" ]
[ "谭捷 女,1991年12月出生于河北省石家庄市。现为军事科学院副研究员。主要研究方向为人工智能和软件工程。E-mail: j.tanjie@outlook.com" ]
[ "杨文婧 女,1988年10月出生于湖南省长沙市。现为国防科技大学计算机学院研究员、博士生导师。主要研究方向为以数据为中心的具身智能、空间计算。E-mail: wenjing.yang@nudt.edu.cn" ]
收稿:2025-12-07,
录用:2026-01-19,
纸质出版:2026-01-25
移动端阅览
汪子尧, 田瑜, 黄俊杰, 等. 基于机器学习与规则推理的SSD故障预测方法研究及对比分析[J]. 电子学报, 2026, 54(01): 115-124.
WANG Ziyao, TIAN Yu, HUANG Junjie, et al. Research and Comparative Analysis of SSD Failure Prediction Methods Based on Machine Learning and Rule-Based Reasoning[J]. Acta Electronica Sinica, 2026, 54(01): 115-124.
汪子尧, 田瑜, 黄俊杰, 等. 基于机器学习与规则推理的SSD故障预测方法研究及对比分析[J]. 电子学报, 2026, 54(01): 115-124. DOI:10.12263/DZXB.20250975
WANG Ziyao, TIAN Yu, HUANG Junjie, et al. Research and Comparative Analysis of SSD Failure Prediction Methods Based on Machine Learning and Rule-Based Reasoning[J]. Acta Electronica Sinica, 2026, 54(01): 115-124. DOI:10.12263/DZXB.20250975
随着云计算、大数据及人工智能应用的快速演进,数据中心规模持续扩张,存储系统的可靠性已成为影响其稳定运行与服务可用性的关键因素。固态硬盘(Solid-State Drive, SSD)作为数据中心存储系统的关键组成部分,因其高吞吐、低时延、低功耗等特性被广泛部署于数据中心核心存储层,但在大规模、长周期运行条件下,SSD故障呈现出突发性强、演化模式复杂等特征,对业务连续性与数据安全构成严峻挑战。为提高SSD故障预测的准确性与实用性,本文提出基于分类模型与特征工程的机器学习预测方法,以及基于显式规则引擎和动态特征补偿的规则推理预测方法。机器学习预测方法通过多阶段特征工程与集成学习,在数据完备场景下实现了0.968的宏平均
F
1
分数,但其“黑盒”特性在某种程度上限制了工业应用。规则推理预测方法通过构建多算法融合的显式规则引擎,并引入基于SHAP(SHapley Additive exPlanations)值的动态特征补偿机制,在数据完整情况下达到0.988的准确率;在8个特征缺失的极端场景下仍保持0.941的准确率,展现出强鲁棒性。实验结果对比分析表明,机器学习预测方法在数据完备时预测精度高,规则推理预测方法则在可解释性、实时性与缺失数据适应能力方面更具优势。本文进一步探讨了两类方法的融合路径,为构建兼具感知能力与推理透明性的下一代智能运维系统提供了理论支撑与实践参考。
With the rapid evolution of cloud computing
big data
and artificial intelligence applications
the scale of data centers continues to expand
and the reliability of storage systems has become a critical factor affecting their stable operation and service availability. As a key component of data center storage systems
solid-state drives (SSDs) are widely deployed in the core storage layers of data centers owing to their advantages of high throughput
low latency
and low power consumption. However
under large-scale and long-term operating conditions
SSD failures are characterized by strong suddenness and complex evolution patterns
posing severe challenges to service continuity and data security. To enhance the accuracy and practicality of failure prediction
this paper investigates a machine learning prediction methodology based on classification models and feature engineering
alongside a rule-based reasoning prediction approach utilizing an explicit rule engine and dynamic feature compensation. The machine learning methodology
through multi-stage feature engineering and ensemble learning
achieves a macro-average
F
1
-score of 0.968 under complete data conditions; however
its “black-box” nature somewhat lim
its its industrial applicability. In contrast
the rule-based reasoning approach constructs an explicit rule engine integrating multiple algorithms and introduces a dynamic feature compensation mechanism based on SHAP (SHapley Additive exPlanations) values. This method attains an accuracy of 0.988 with complete data and maintains an accuracy of 0.941 under extreme conditions with eight missing features
demonstrating strong robustness. Comparative analysis of experimental results indicates that the machine learning methodology excels in prediction accuracy with complete data
while the rule-based reasoning approach offers superior interpretability
real-time performance
and adaptability to missing data. This paper further explores potential pathways for integrating these two methodologies
providing theoretical support and practical references for constructing next-generation intelligent operation and maintenance systems that possess both perceptual capability and transparent reasoning.
Xu Fan , Han Shujie , Lee P P C , et al . General feature selection for failure prediction in Large-scale SSD deployment [C/OL ] // 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , 2021 : 263 - 270 . https://ieeexplore.ieee.org/document/9505157 https://ieeexplore.ieee.org/document/9505157 .
Botezatu M M , Giurgiu I , Bogojeska J , et al . Predicting disk replacement towards reliable data centers [C/OL ] // The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016 : 39 - 48 . https://dl.acm.org/doi/10.1145/2939672.2939699 https://dl.acm.org/doi/10.1145/2939672.2939699 .
Meza J , Wu Q , Kumar S , et al . A large-scale study of flash memory failures in the field [J ] . ACM Sigmetrics Performance Evaluation Review , 2015 , 43 ( 1 ): 177 - 190 .
Schroeder B , Lagisetty R , Merchant A . Flash reliability in production: the expected and the unexpected [C/OL ] // The 14th usenix conference on file and storage technologies , 2016 : 67 - 80 . https://dl.acm.org/doi/10.5555/2930583.2930589 https://dl.acm.org/doi/10.5555/2930583.2930589 .
Pinheiro E , Weber W D , Barroso L A . Failure trends in a large disk drive population [C/OL ] // Proceedings of the 5th USENIX conference on File and Storage Technologies , 2007 : 2 . https://api.semanticscholar.org/CorpusID:2420428 https://api.semanticscholar.org/CorpusID:2420428 .
You Wenyan , Dong Jiayuan , Feng Xingdi , et al . SSD failures in Large-scale data centers: what why and how [C/OL ] // 2024 International Conference on Networking, Architecture and Storage (NAS) . 2024 : 1 - 8 . https://ieeexplore.ieee.org/abstract/document/10781343 https://ieeexplore.ieee.org/abstract/document/10781343 .
Li Jing , Ji Xinpu , Jia Yuhan , et al . Hard drive failure prediction using classification and regression trees [C ] // 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks , 2014 : 383 - 394 . https://ieeexplore.ieee.org/document/6903596 https://ieeexplore.ieee.org/document/6903596 .
Koh C , Kang J S , Kim T , et al . Temporal-contextual attention network for solid-state drive failure prediction in data centers [J ] . IEEE Access , 2024 , 12 : 154455 - 154466 .
Wang Xiaofei , Zhang Yang , Chen Junyan , et al . Proactive SSD failure prediction with a gradient-guided LSTM-xLSTM hybrid model [C ] // 2025 IEEE International Conference on Cluster Computing . Piscataway : IEEE , 2025 : 11186457 .
Rudin C . Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead [J ] . Nature Machine Intelligence , 2019 , 1 ( 5 ): 206 - 215 .
Abu-Soud S M . A novel approach for dealing with missing values in machine learning datasets with discrete values [C ] // 2019 International Conference on Computer and Information Sciences . Piscataway : IEEE , 2019 : 8716430 .
Menghani G . Efficient deep learning: A survey on making deep learning models smaller, faster, and better [J ] . ACM Computing Surveys , 2023 , 55 ( 12 ): 1 - 37 .
van der Waa J , Schoonderwoerd T , van Diggelen J , et al . Interpretable confidence measures for decision support systems [J ] . International Journal of Human-Computer Studies , 2020 , 144 : 102493 .
Góra G , Skowron A , Wojna A . Explainability in RIONA algorithm combining rule induction and instance-based learning [C/OL ] // 2023 18th Conference on Computer Science and Intelligence Systems . 2023 : 491 - 502 . https://ieeexplore.ieee.org/document/10305962 https://ieeexplore.ieee.org/document/10305962 .
Quinlan J R . C4 . 5 : Programs for machine learning [M ] . CA, USA : Morgan Kaufmann Publishers Inc , 1993: 236 - 239
Frank E , Witten I H . Generating accurate rule sets without global optimization [C ] // Proceedings of the Fifteenth International Conference on Machine Learning . New York : ACM , 1998 : 144 - 151 .
Govada A , Thomas V S , Samal I , et al . Distributed multi-class rule based classification using RIPPER [C ] // 2016 IEEE International Conference on Computer and Information Technology . Piscataway : IEEE , 2016 : 303 - 309 .
Marcílio W E , Eler D M . From explanations to feature selection: Assessing SHAP values as feature selection mechanism [C ] // 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images . Piscataway : IEEE , 2020 : 340 - 347 .
Kwon Y , Lee Z . A hybrid decision support system for adaptive trading strategies: Combining a rule-based expert system with a deep reinforcement learning strategy [J ] . Decision Support Systems , 2024 , 177 : 114100 .
Li Peng , Liu Kai , Dang Wei , et al . Reliability assessment of NAND SSD based on acceleration degradation test [C ] // 2017 IEEE International Conference on Industrial Engineering and Engineering Management . Piscataway : IEEE , 2017 : 1945 - 1949 .
ODCC2505002 NVMe子系统故障预测——健康度指标行业标准草案 [S ] .
Breiman L . Random forests [J ] . Machine Learning , 2001 , 45 ( 1 ): 5 - 32 .
Fürnkranz J , Gamberger D , Lavrač N . Foundations of rule learning (1st ed) [M ] . Berlin : Springer , 2012 .
Cohen W W . Fast effective rule induction [M ] // Machine Learning Proceedings 1995 . Amsterdam : Elsevier , 1995 : 115 - 123 .
Kwasny S C , Faisal K A . Overcoming limitations of rule-based systems: An example of a hybrid deterministic parser [C ] // Konnektionismus in Artificial Intelligence und Kognitionsforschung . Berlin, Heidelberg : Springer , 1990 : 48 - 57 .
杨鑫文 , 麦钰岚 , 郭巧玉 . 一种固态硬盘的寿命预测方法及系统 : CN202410517838.1 [P ] . 2024-05-28 .
Tabebordbar A , Beheshti A , Benatallah B , et al . Feature-based and adaptive rule adaptation in dynamic environments [J ] . Data Science and Engineering , 2020 , 5 ( 3 ): 207 - 223 .
0
浏览量
53
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621