1.中国民航大学安全科学与工程学院,天津 300300
2.中国民航大学计算机科学与技术学院,天津 300300
3.扬州大学信息工程学院,江苏扬州 225127
4.中国民航大学民航飞联网重点实验室,天津 300300
[ "杨宏宇 男,1969年12月出生,吉林长春人.博士,中国民航大学教授、博士生导师.主要研究方向为网络与系统安全、漏洞分析与评估、云计算与大数据安全. E-mail: yhyxlx@hotmail.com" ]
[ "王云龙 男,1998年12月出生,河北沧州人.中国民航大学硕士研究生.主要研究方向为网络信息安全、软件供应链安全、数据安全. E-mail: luckyfuture0177@163.com" ]
[ "胡泽 男,1989年7月出生,山西临汾人.博士,中国民航大学讲师.主要研究方向为自然语言处理、人工智能、信息安全." ]
[ "成翔 男,1988年9月出生,新疆乌鲁木齐人.博士,扬州大学实验师.主要研究方向为网络与系统安全、网络安全态势感知、联邦学习、边缘计算." ]
收稿:2024-08-21,
修回:2025-01-20,
纸质出版:2025-04-25
移动端阅览
杨宏宇, 王云龙, 胡泽, 等. 基于跨模态协同表示学习的二进制代码相似性检测方法[J]. 电子学报, 2025, 53(04): 1279-1292.
YANG Hong-yu, WANG Yun-long, HU Ze, et al. Binary Code Similarity Detection Method Based on Cross-Modal Coordinated Representation Learning[J]. Acta Electronica Sinica, 2025, 53(04): 1279-1292.
杨宏宇, 王云龙, 胡泽, 等. 基于跨模态协同表示学习的二进制代码相似性检测方法[J]. 电子学报, 2025, 53(04): 1279-1292. DOI:10.12263/DZXB.20240769
YANG Hong-yu, WANG Yun-long, HU Ze, et al. Binary Code Similarity Detection Method Based on Cross-Modal Coordinated Representation Learning[J]. Acta Electronica Sinica, 2025, 53(04): 1279-1292. DOI:10.12263/DZXB.20240769
二进制代码相似性检测(Binary Code Similarity Detection,BCSD)技术能够在无源代码的情况下检测二进制文件内在的安全威胁,在软件成分分析、漏洞挖掘等软件供应链安全领域中广泛应用.针对现有BCSD方法普遍忽略程序实际执行信息和局部语义信息,导致汇编指令语义表示学习效果不佳、特征提取模型的训练资源消耗过大以及相似性检测性能较差等问题,提出一种基于跨模态协同表示学习的二进制代码相似性检测方法(Cross-Modal coordinated Representation Learning for binary code similarity detection,CMRL).首先,提取汇编指令序列和编程语言片段语义间的对应关系并构建一个对比学习数据集,提出一种面向二进制代码的汇编指令-编程语言协同表示学习方法(Assembly code-Programming language Coordinated representations Learning method,APECL),将源代码的高层次语义作为监督信息,通过对比学习任务使汇编指令编码器APECL-Asm与编程语言编码器生成的特征表示在语义空间中对齐,提升APECL-Asm对汇编指令的语义表示学习效果.然后,设计一种基于图神经网络的二进制函数嵌入向量生成方法,通过语义结构感知网络对APECL-Asm提取到的语义信息和程序实际执行信息进行融合,生成函数嵌入向量.最后,通过计算函数嵌入向量之间的余弦距离对二进制代码进行相似性检测.实验结果表明,与现有方法相比,CMRL对二进制代码相似性检测的Recall@1指标提升8%~33%;针对代码混淆场景下的相似性检测任务,CMRL的Recall@1指标衰减幅度更小,具有更强的抗干扰能力.
Existing binary code similarity detection (BCSD) methods often overlook the actual execution information and local semantic details of programs
leading to suboptimal performance in assembly code semantic representation learning
high training resource consumption
and poor similarity detection performance. To address these issues
this paper proposes a cross-modal coordinated representation learning method (CMRL) for binary code similarity detection. First
we extract the semantic correspondence between assembly instruction sequences and programming language fragments to construct a contrastive learning dataset. We then propose an assembly code-programming language coordinated representation learning method (APECL)
which uses the high-level semantics of source code as supervisory information. Through contrastive learning tasks
we align the feature representations of the APECL-Asm encoder and the programming language encoder in the semantic space
thereby enhancing the semantic representation learning capability of APECL-Asm for assembly instructions. Next
we design a graph neural network-based method for generating binary function embedding vectors. This method uses a semantic structure-aware network to fuse the semantic information extracted by APECL-Asm with the actual execution information of the program
generating function embedding vectors for similarity detection. Experimental results show that compared to existing methods
CMRL improves the Recall@1 metric for binary code similarity detection by 8%~33%. Additionally
in the context of code obfuscation
CMRL exhibits stronger resilience
with less degradation in the Recall@1 metric.
Synopsys . Open source security and risk analysis report [EB/OL ] . ( 2022-10-01 )[ 2024-8-21 ] . https://www.synop-sys.com/content/dam/synopsys/sigassets/reports/rep-ossra-2022.pdf https://www.synop-sys.com/content/dam/synopsys/sigassets/reports/rep-ossra-2022.pdf .
NIST . CVE-2024-3094 vulnerability details [EB/OL ] . ( 2024-06-21 ) [ 2024-08-21 ] . https://nvd.nist.gov/vuln/detail/CVE-2024-3094 https://nvd.nist.gov/vuln/detail/CVE-2024-3094 .
于颖超 , 甘水滔 , 邱俊洋 , 等 . 二进制代码相似度分析及在嵌入式设备固件漏洞搜索中的应用 [J ] . 软件学报 , 2022 , 33 ( 11 ): 4137 - 4172 .
YU Y C , GAN S T , QIU J Y , et al . Binary code similarity analysis and its applications on embedded device firmware vulnerability search [J ] . Journal of Software , 2022 , 33 ( 11 ): 4137 - 4172 . (in Chinese)
DING S H H , FUNG B C M , CHARLAND P . Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization [C ] // 2019 IEEE Symposium on Security and Privacy (SP) . Piscataway : IEEE , 2019 : 472 - 489 .
ZUO F , LI X P , YOUNG P , et al . Neural machine translation inspired binary code similarity comparison beyond function pairs [C ] // Proceedings of the 26th Network and Distributed System Security Symposium (NDSS) . San Diego : Internet Society , 2019 : 1 - 13 .
LI X , QU Y , YIN H , et al . PalmTree: Learning an assembly language model for instruction embedding [C ] // Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2021 : 3236 - 3251 .
LUO Z H , WANG P F , WANG B S , et al . VulHawk: Cross-architecture vulnerability detection with entropy-based binary code search [C ] // Proceedings of the Network and Distributed System Security Symposium (NDSS) . San Diego, CA : Internet Society , 2023 : 14 - 25 .
RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // Proceedings of the 38th International Conference on Machine Learning . New York : ACM , 2021 : 8748 - 8763 .
WANG H , GAO Z Y , ZHANG C , et al . CLAP: Learning transferable binary code representations with natural language supervision [C ] // Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis . New York : ACM , 2024 : 503 - 515 .
BAKER B S , MANBER U , MUTH R . Compressing differences of executable code [C ] // ACMSIGPLAN Workshop on Compiler Support for System Software (WCSS) . Princeton : Citeseer , 1999 : 1 - 10 .
ESCHWEILER S , YAKDAN K , GERHARDS-PADILLA E . DiscovRE: Efficient cross-architecture identification of bugs in binary code [C ] // Proceedings of the Network and Distributed System Security Symposium . San Diego : Internet Society , 2016 : 137 - 172 .
ZYNAMICS . BinDiff home [EB/OL ] . ( 2023-08-15 )[ 2024-08-21 ] . https://www.zynamics.com/bindiff.html https://www.zynamics.com/bindiff.html .
FENG Q , ZHOU R D , XU C C , et al . Scalable graph-based bug search for firmware images [C ] // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2016 : 480 - 491 .
PEWNY J , GARMANY B , GAWLIK R , et al . Cross-architecture bug search in binary executables [C ] // Proceedings of the IEEE Symposium on Security and Privacy . Piscataway : IEEE , 2015 : 709 - 724 .
LUO L N , MING J , WU D H , et al . Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection [J ] . IEEE Transactions on Software Engineering , 2017 , 43 ( 12 ): 1157 - 1177 .
MING J , XU D P , JIANG Y F , et al . BinSim: Trace-based semantic binary diffing via system call sliced segment equivalence checking [C ] // Proceedings of the 26th USENIX Conference on Security Symposium . Vancouver : USENIX Association , 2017 : 253 - 270 .
GAO J , YANG X , FU Y , et al . VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary [C ] // 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) . Piscataway : IEEE , 2018 : 896 - 899 .
LIN H , ZHAO D D , RAN L J , et al . CVSSA: Cross-architecture vulnerability search in firmware based on support vector machine and attributed control flow graph [C ] // Proceedings of the 2017 International Conference on Dependable Systems and Their Applications (DSA) . Piscataway : IEEE , 2017 : 35 - 41 .
XU X J , LIU C , FENG Q , et al . Neural network-based graph embedding for cross-platform binary code similarity detection [C ] // Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2017 : 363 - 376 .
DUAN Y , LI X , WANG J H , et al . Deepbindiff: Learning program-wide code representations for binary diffing.2020 [C ] // Proceedings of the Network and Distributed System Security Symposium . New York : ACM , 2020 : 23 - 26 .
YANG J , FU C , LIU X Y , et al . Codee: A tensor embedding scheme for binary code search [J ] . IEEE Transactions on Software Engineering , 2021 , 48 ( 7 ): 2224 - 2244 .
PEI K X , XUAN Z , YANG J F , et al . Trex: Learning execution semantics from micro-traces for binary similarity [EB/OL ] . ( 2020-10-01 )[ 2024-08-21 ] . https://arxiv.org/abs/2012.08680v3 https://arxiv.org/abs/2012.08680v3 .
MASSARELLI L , DI LUNA G A , PETRONI F , et al . SAFE: Self-attentive function embeddings for binary similarity [M ] // Detection of Intrusions and Malware , and Vulnerability Assessment. Cham : Springer International Publishing , 2019 : 309 - 329 .
WANG J L , ZHANG C , CHEN L F , et al . Improving ML-based binary function similarity detection by assessing and deprioritizing control flow graph features [C ] // Proceedings of the 33rd USENIX Conference on Security Symposium . New York : ACM , 2024 : 4265 - 4282 .
ALLAMANIS M , BARR E T , DEVANBU P , et al . A survey of machine learning for big code and naturalness [J ] . ACM Computing Surveys (CSUR) , 2018 , 51 ( 4 ): 1 - 37 .
XU X Z , FENG S W , YE Y P , et al . Improving binary code similarity transformer models by semantics-driven instruction deemphasis [C ] // Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis . New York : ACM , 2023 : 1106 - 1118 .
WANG H , QU W J , KATZ G , et al . Jtrans: Jump-aware transformer for binary code similarity detection [C ] // Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis . New York : ACM , 2022 : 1 - 13 .
YU Z P , CAO R , TANG Q Y , et al . Order matters: Semantic-aware neural networks for binary code similarity detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 1 ): 1145 - 1152 .
KIM D , KIM E , CHA S K , et al . Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned [J ] . IEEE Transactions on Software Engineering , 2023 , 49 ( 4 ): 1661 - 1682 .
GUO D Y , LU S , DUAN N , et al . UniXcoder: Unified cross-modal pre-training for code representation [C ] // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : USAACL , 2022 : 7212 - 7225 .
GU Y M , SHU H , KANG F . BinAIV: Semantic-enhanced vulnerability detection for Linux x86 binaries [J ] . Computers & Security , 2023 , 135 : 103508 .
0
浏览量
12
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621