123,123

基于代码嵌入的二进制代码相似性分析方法

网络安全与数据治理 2023年3期

熊敏，薛吟兴，徐云

（1.中国科学技术大学计算机科学与技术学院，安徽合肥230026； 2.安徽省高性能计算重点实验室，安徽合肥230026）

摘要： 代码嵌入利用神经网络模型将二进制函数的代码表示转化为向量，在漏洞搜索等应用中展现了优势。现有的方法将函数表示为汇编指令序列、控制流图的拓扑结构或若干路径，都没有克服不同编译环境导致控制流图结构变化的干扰。为此,设计了基于基本块树（Basic Block Tree, BBT)的代码表示以及构建了对应的代码嵌入模型BBTree。首先，二进制函数被表示为一系列BBT，每个BBT被处理为指令序列；其次，BBTree利用LSTM和BiGRU将基于BBT的代码表示转化为向量；最后，通过计算向量间的距离去高效衡量对应函数的相似性。在代码搜索中，BBTree的平均准确率比主流工具提升了24.8%；在漏洞搜索中，BBTree的平均召回率比主流工具提升了26.1%。

關(guān)鍵詞： 代码表示代码嵌入模型代码搜索漏洞搜索

中圖分類號(hào)：TP311.5
文獻(xiàn)標(biāo)識(shí)碼：A
DOI:10.19358/j.issn.2097-1788.2023.03.010
引用格式：熊敏，薛吟興，徐云.基于代碼嵌入的二進(jìn)制代碼相似性分析方法［J］.網(wǎng)絡(luò)安全與數(shù)據(jù)治理，2023,42(3):58-67.

A binary code similarity analysis method based on code embedding

Xiong Min1,2，Xue Yinxing1，Xu Yun 1,2

(1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; 2. Key Laboratory of High Performance Computing of Anhui Province, Hefei 230026, China)

Abstract： Code embedding utilizes neural network models to convert binary code into a vector, showing advantages in applications such as vulnerability searching. Existing methods represent functions as assembly instruction sequences, topology structures of control flow graphs, or several paths.However, none of them can overcome the interference produced by the structural changes in control flow graphs caused by different compilation environments.To this end, this paper designs a basic block tree (BBT)-based code representation and builds a corresponding code embedding model named BBTree.Firstly, the binary function is represented as a series of BBTs, and each BBT is processed into an instruction sequence Secondly, BBTree utilizes LSTM and Bi.GRU to convert the BBT.based code representation into a numerical vector Last, the distance between vectors is calculated to efficiently measure the similarity of corresponding functions. In code search, BBTree’s average accuracy rate is 24.8% higher than mainstream tools; in vulnerability search, BBTree’s average recall rate is 26.1% higher than mainstream tools.

Key words :

0 引言

由于商業(yè)程序、遺留程序和惡意代碼的源碼不公開(kāi)，因此，對(duì)這些程序進(jìn)行二進(jìn)制代碼相似性分析具有很多安全應(yīng)用，比如抄襲檢測(cè)、惡意軟件檢測(cè)、漏洞搜索等。相似性分析旨在根據(jù)已有的二進(jìn)制代碼（如已揭露的漏洞等）在代碼庫(kù)中搜索出語(yǔ)義相似的二進(jìn)制代碼，從而探測(cè)出潛在的漏洞，維護(hù)程序的安全。二進(jìn)制代碼嵌入作為一種新興的相似性分析技術(shù)，利用神經(jīng)網(wǎng)絡(luò)模型將二進(jìn)制函數(shù)的代碼表示轉(zhuǎn)化為數(shù)值向量，不僅學(xué)習(xí)了二進(jìn)制代碼的語(yǔ)義，還可以通過(guò)計(jì)算向量間的距離去定量分析對(duì)應(yīng)函數(shù)的相似性。

本文詳細(xì)內(nèi)容請(qǐng)下載：http://www.ihrv.cn/resource/share/2000005257

作者信息：

熊敏1,2，薛吟興1，徐云1,2

（1.中國(guó)科學(xué)技術(shù)大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院，安徽合肥230026；2.安徽省高性能計(jì)算重點(diǎn)實(shí)驗(yàn)室，安徽合肥230026）

微信圖片_20210517164139.jpg

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容