

浏览全部资源
扫码关注微信
1.华中农业大学信息学院,湖北武汉430070
2.湖北工业大学太阳能高效利用及储能运行控制湖北省重点实验室,湖北武汉,430068
Received:07 May 2024,
Revised:2024-09-23,
Published:25 January 2025
移动端阅览
郭曦, 王盼. 基于统计推理的二进制程序语义比较模型[J]. 电子学报, 2025, 53(01): 163-181.
GUO Xi, WANG Pan. Semantic Comparison Model for Binary Programs Based on Statistical Reasoning[J]. Acta Electronica Sinica, 2025, 53(01): 163-181.
郭曦, 王盼. 基于统计推理的二进制程序语义比较模型[J]. 电子学报, 2025, 53(01): 163-181. DOI:10.12263/DZXB.20240408
GUO Xi, WANG Pan. Semantic Comparison Model for Binary Programs Based on Statistical Reasoning[J]. Acta Electronica Sinica, 2025, 53(01): 163-181. DOI:10.12263/DZXB.20240408
在程序缺陷分析、恶意代码发掘等过程中,通常需要对二进制程序的行为相似性进行分析.目前基于语法的相似性分析方法忽略了程序的执行语义,存在分析精度不高的问题.基于语义的相似性分析方法在符号逻辑公式生成过程中,频繁地调用约束求解器进行语义相似性比较,会产生巨大的计算开销.提出一种基于统计推理的代码相似性模糊匹配分析方法,从指令级别相似度的计算开始,逐级对基本块及函数间的语义相似性进行推理.首先将二进制代码按照一定的规则划分为具有规范形式的片段集合,在基本块粒度上使用动态规划的方法构建有相同执行语义的存储表,从而获得基本块间的指令初始语义映射.然后通过邻域搜索的方法将该映射拓展到目标分析函数,并在该过程中提取函数的执行语义.最后通过对相似函数的结果进行统计分析,进而计算二进制文件的相似度.同时采用无监督的预训练分析方法,通过调优预训练模型的参数从而提高代码相似分析的精度.从跨平台及优化选项的角度对13个主流的开源项目进行了实验,实验结果表明相较于对比工具,本文方法的分析精度平均提高7.26%,同时消融实验表明,本文的预训练模型可以有效提高二进制程序语义匹配的性能.
In the process of program defects and malicious code discovery
it is necessary to analyze the behavioral similarity of binary programs. Currently
syntax-based similarity analysis methods often ignore the execution semantics of the program
resulting in low analysis accuracy; In the process of generating symbolic logic formulas
semantic based analysis methods frequently call constraint solvers for semantic similarity comparison
resulting in significant time overhead. This article proposes a code similarity fuzzy matching analysis method based on statistical inference for binary programs. Starting from the calculation of instruction level similarity
the semantic similarity between basic blocks and functions is inferred step by step. Firstly
the binary code is divided into a set of fragments with a standardized form according to certain rules
and dynamic programming is used at the basic block granularity to construct a storage table with the same execution semantics for the longest common subsequence
thereby obtaining the initial semantic mapping of instructions between basic blocks; Then
the mapping is extended to the target analysis code through neighborhood search
and the execution semantics of the fragments are learned during this process; Finally
statistical analysis is performed on the results of similar fragments to calculate the similarity of binary codes. During the experiment
an unsupervised pre training analysis method was used to improve the accuracy of code similarity analysis by tuning the pre training model parameters. Experiments were conducted on 13 mainstream open-source projects from the perspective of cross platform and optimization options. The experimental results showed that compared to the comparison tools
the analysis accuracy of our method improved by an average of 7.26%
Meanwhile
ablation experiments have shown that the pre trained model proposed in this paper can effectively improve the semantic matching performance of binary programs.
WHALE G . Plague: Plagiarism Detection Using Program Structure [R ] . Sydney : University of New South Wales , 1988 : 1 - 13 .
YANG S G , DONG C P , XIAO Y , et al . Asteria-pro: Enhancing deep learning-based binary code similarity detection by incorporating domain knowledge [J ] . ACM Transactions on Software Engineering and Methodology , 2023 , 33 ( 1 ): 1 - 40 .
LIU B C , HUO W , ZHANG C , et al . αDiff: Cross-version binary code similarity detection with DNN [C ] // Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering . New York : ACM , 2018 : 667 - 678 .
DAVID Y , PARTUSH N , YAHAV E , et al . Statistical similarity of binaries [C ] // Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation . New York : ACM , 2016 : 266 - 280 .
MUJA M , LOWE D G . Fast approximate nearest neighbors with automatic algorithm configuration [C ] // Proceedings of the Fourth International Conference on Computer Vision Theory and Applications . SciTePress - Science and and Technology Publications , 2009 : 331 - 340 .
HUANG H , YOUSSEF A M , DEBBABI M , et al . Binsequence: Fast, accurate and scalable binary code reuse detection [C ] // Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security . New York : ACM , 2017 : 155 - 166 .
YU Z P , CAO R , TANG Q Y , et al . Order matters: Semantic-aware neural networks for binary code similarity detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 1 ): 1145 - 1152 .
FENG Q , ZHOU R D , XU C C , et al . Scalable graph-based bug search for firmware images [C ] // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2016 : 480 - 491 .
COLLYER J , WATSON T , PHILLIPS I . FASER: Binary code similarity search through the use of intermediate representations [EB/OL ] . ( 2023-11-29 )[ 2024-05-07 ] . https://arxiv.org/abs/2310.03605v3 https://arxiv.org/abs/2310.03605v3 .
LUO Z H , WANG P F , WANG B S , et al . VulHawk: Cross-architecture vulnerability detection with entropy-based binary code search [C ] // Proceedings 2023 Network and Distributed System Security Symposium . Internet Society , 2023 : 1 - 13 .
XU X J , LIU C , FENG Q , et al . Neural network-based graph embedding for cross-platform binary code similarity detection [C ] // Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2017 : 363 - 376 .
DING S H H , FUNG B , CHARLAND P . Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization [C ] // 2019 IEEE Symposium on Security and Privacy (SP) . Piscataway : IEEE , 2019 : 472 - 489 .
MASSARELLI L , DI LUNA G A , PETRONI F , et al . SAFE: Self-attentive function embeddings for binary similarity [M ] // Detection of Intrusions and Malware , and Vulnerability Assessment. Cham : Springer International Publishing , 2019 : 309 - 329 .
DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 . Cham : Springer International Publishing , 2019: 4171 - 4186 .
LI X , QU Y , YIN H , et al . PalmTree: Learning an assembly language model for instruction embedding [C ] // In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2021 : 3236 - 3251 .
WANG H , QU W J , KATZ G , et al . jTrans: Jump-aware transformer for binary code similarity detection [C ] // Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis . New York : ACM , 2022 : 1 - 13 .
ZHANG X C , SUN W J , PANG J M , et al . Similarity metric method for binary basic blocks of cross-instruction set architecture [C ] // Proceedings 2020 Workshop on Binary Analysis Research . Reston : Internet Society , 2020 : 1 - 13 .
YANG S G , CHENG L , ZENG Y C , et al . Asteria: Deep learning based AST-encoding for cross-platform binary code similarity detection [C ] // 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks . Piscataway : IEEE , 2021 : 224 - 236 .
YANG J , FU C , LIU X Y , et al . Codee: A tensor embedding scheme for binary code search [J ] . IEEE Transactions on Software Engineering , 2022 , 48 ( 7 ): 2224 - 2244 .
于璞 , 舒辉 , 熊小兵 , 等 . 基于分片融合的代码隐式混淆技术 [J ] . 软件学报 , 2023 , 34 ( 4 ): 1650 - 1665 .
YU P , SHU H , XIONG X B , et al . Implicit code obfuscation technique based on code slice fusion [J ] . Journal of Software , 2023 , 34 ( 4 ): 1650 - 1665 . (in Chinese)
LIN W , GUO Q L , YIN J W , et al . FSmell: Recognizing Inline function in binary code [M ] // Computer Security- ESORICS 2023 . Cham : Springer Nature Switzerland , 2024 : 487 - 506 .
KIM S , KIM H , CHA S K , et al . FunProbe: Probing functions from binary code through probabilistic analysis [C ] // In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . New York : ACM , 2023 : 1419 - 1430 .
BAO T , BURKET J , WOO M , et al . BYTEWEIGHT: Learning to recognize functions in binary code [C ] // Proceedings of the 23rd USENIX Conference on Security Symposium . New York : ACM , 2014 : 845 - 860 .
YU S , QU Y , HU X C , et al . DeepDi: Learning a relational graph convolutional network model on instructions for fast and accurate disassembly [C ] // USENIX Security Symposium . Piscataway : IEEE , 2022 : 2709 - 2725 .
LESKOVEC J , RAJARAMAN A , ULLMAN J D . Mining of Massive Datasets [M ] . 2nd Ed . Cambridge, UK : Cambridge University Press , 2014 .
PEI K X , XUAN Z , YANG J F , et al . Trex: Learning execution semantics from micro-traces for binary similarity [EB/OL ] . ( 2021-04-26 )[ 2024-05-07 ] . https://arxiv.org/abs/2012.08680v3 https://arxiv.org/abs/2012.08680v3 .
0
Views
12
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621