HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY 4.0
arXiv:2312.08317v1 [cs.CR] 13 Dec 2023

Prompt Engineering-assisted Malware Dynamic Analysis Using GPT-4

Pei Yan
Miaohui Wang
1Guangdong Key Laboratory of
Intelligent Information Processing
2Shenzhen Key Laboratory of Media Security
Shenzhen, China
yanpei2022@email.szu.edu.cn
1Guangdong Key Laboratory of
Intelligent Information Processing
2Shenzhen Key Laboratory of Media Security
Shenzhen, China
wang.miaohui@gmail.com
   Shunquan Tan
Jiwu Huangnormal-∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT
*Corresponding author 1Guangdong Key Laboratory of
Intelligent Information Processing
2Shenzhen Key Laboratory of Media Security
Shenzhen, China
tansq@szu.edu.cn
1Guangdong Key Laboratory of
Intelligent Information Processing
2Shenzhen Key Laboratory of Media Security
Shenzhen, China
jwhuang@szu.edu.cn
Abstract

Dynamic analysis methods effectively identify shelled, wrapped, or obfuscated malware, thereby preventing them from invading computers. As a significant representation of dynamic malware behavior, the API (Application Programming Interface) sequence, comprised of consecutive API calls, has progressively become the dominant feature of dynamic analysis methods. Though there have been numerous deep learning models for malware detection based on API sequences, the quality of API call representations produced by those models is limited. These models cannot generate representations for unknown API calls, which weakens both the detection performance and the generalization. Further, the concept drift phenomenon of API calls is prominent. To tackle these issues, we introduce a prompt engineering-assisted malware dynamic analysis using GPT-4. In this method, GPT-4 is employed to create explanatory text for each API call within the API sequence. Afterward, the pre-trained language model BERT (Bidirectional Encoder Representations from Transformers) is used to obtain the representation of the text, from which we derive the representation of the API sequence. Theoretically, this proposed method is capable of generating representations for all API calls, excluding the necessity for dataset training during the generation process. Utilizing the representation, a CNN-based detection model is designed to extract the feature. We adopt five benchmark datasets to validate the performance of the proposed model. The experimental results reveal that the proposed detection algorithm performs better than the state-of-the-art method (TextCNN). Specifically, in cross-database experiments and few-shot learning experiments, the proposed model achieves excellent detection performance and almost a 100% recall rate for malware, verifying its superior generalization performance. The code is available at: github.com/yan-scnu/Prompted_Dynamic_Detection.

Index Terms:
Computer security, malware detection, prompt engineering, large language model

1 Introduction

Refer to caption
Figure 1: The comparison between the proposed and the existing representation methods. When placing the suspicious code from the terminal into a sandbox for execution, the sandbox will output the API sequence called by code. We propose to employ GPT-4 for generating explanatory text for each API call within the API sequence. We then embed this explanatory text and concatenate the resulting embedded matrices to achieve representation. By introducing additional prior knowledge, our method enhances representation efficacy, thereby outperforming previous methods.

Malware poses a significant risk to the network and computer security, rendering effective malware detection an essential and challenging task[1, 2, 3]. A dynamic analysis method was proposed to achieve better detection performance. This method involves running the malware in a sandbox and recording its dynamic behavior. The code is judged based on its dynamic behavior, providing a better mechanism to detect obfuscated and packaged malware.

For dynamic analysis methods, the most commonly used dynamic behavior is the API sequence, composed of API calls during the code’s runtime process. The encoded API sequence shares certain similarities with encoded natural text[4, 5]. As Deep Learning (DL) technology has yielded promising results in Natural Language Processing (NLP) tasks, many DL-based text classification models have been applied to identifying API sequences, achieving excellent classification performance[6, 7, 8]. With the recent rise of the Large Language Model (LLM)[9, 10, 11], Many fields are attempting to integrate large language models to solve problems [12, 13]. Malware analysts are exploring ways to further enhance detection performance using LLMs. For instance, Transformer-based network architectures are being designed for malware classification[14, 15].

However, due to limited training data and the differences between API sequences and encoded text sequences, the detection performance of state-of-the-art(SOTA) methods remains constrained. Moreover, several issues persist in current API-based detection models.

  • Limited representation of malware features. The representation of API calls is refined during the training process. The quality of the training data significantly influences the effectiveness of the representation.

  • Weak generalization performance. The representation, obtained during the training phase, may be susceptible to overfitting on a specific dataset. Consequently, this can lead to inferior performance when validated against other datasets.

  • Sensitive to concept drift. As systems and detection tools iterate, API calls will also be updated accordingly. This can cause the current detection model to lack representation for new API calls. Consequently, it may affect the detection accuracy of future malware.

To address the issues mentioned above, this paper introduces a method for generating representation based on a LLM. This approach employs LLM to produce explanatory text for each API call. We then perform embedding operations on these texts, which serve as the representations for the API calls. With the high-quality explanatory text created by the LLM and the application of pre-trained models, we can directly obtain the representation. This eliminates the need for training with the API sequence dataset, significantly enhancing the efficiency of representation generation.

GPT-4 [16] had demonstrated its better performance than other models (e.g., PaLm [17], Llama [18], ChatGLM [19]) in various language tasks. Consequently, this study employs GPT-4 for the generation of explanatory text for API calls. We generate representations based on this explanatory text and concatenate the text’s representations to obtain the API sequence’s representation, as illustrated in Figure 1. We then design appropriate deep neural networks to learn these representations and classify malware based on them. The primary contributions of this paper are summarized as follows:

  1. 1.

    We guide GPT-4 to generate explanatory texts for API calls, and these texts serve as a representation of each API call during both training and testing procedures. To the best of our knowledge, this is the first report to apply prompt engineering to dynamic malware analysis.

  2. 2.

    With the assistance of descriptive explanatory text, the acquisition of API call representations does not require training with datasets. This approach introduces the representation with more additional knowledge, thereby improving the quality of the representations and enhancing their generalizability.

  3. 3.

    Thanks to the substantial training corpus and robust text restatement capabilities of the GPT-4, it can theoretically generate the representation for all API calls. This not only makes the representation association denser, stronger, and more stable, but also benefits the coping mechanisms for data drift phenomena such as API call updates.

2 Related Work

2.1 Malware Dynamic Analysis

The dynamic analysis method analyzes malware by examining its dynamic behavioral features during execution [20]. By executing the executable file in a sandbox (such as Cuckoo111https://github.com/cuckoosandbox/cuckoo or Cape222https://github.com/mandiant/capa), the sandbox records its behavioral characteristics, which enables us to analyze the file based on these attributes. In comparison with static analysis methods, which do not require execution during the analysis process, dynamic analysis methods can effectively detect shell, obfuscation, and packaged malware. As such, it is currently one of the primary detection methods.

One of the most crucial dynamic behavioral characteristics is the API sequence [4, 5]. The API sequence reflects the interaction between the code and the operating system, providing an understanding of the malware behaviors, which is essential for designing effective malware defense strategies. Earlier research leveraged statistical learning [21, 22, 23], machine learning [24, 25], and graph methods [26] to examine the API sequence and classify codes. Researchers also attempted to employ static analysis features for hybrid analysis [27, 28]. However, the simplicity of early models and the small amount of training data greatly constrained both the generalization and detection performance.

2.2 NLP-based Methods

With the significant success of deep learning technology in text classification tasks, researchers are attempting to apply text classification methods to malware dynamic analysis, given that the encoded text bears a high resemblance to API sequences.

Pascanu et al. [5] were the first to suggest the use of Recurrent Neural Networks (RNN) and Echo State Networks (ESN) for dynamic malware detection, pioneering the application of NLP models in dynamic analysis. Nonetheless, RNN demonstrated some degree of gradient vanishing and gradient explosion. To mitigate the issues of the RNN model, researchers [6, 7, 8, 29, 30] utilized the LSTM and GRU models to design dynamic detection models building on Pascanu’s work. Apart from RNN-based models, CNN-based [31, 32] models and CNN+RNN combination models [33] have also shown good performance.

Although contemporary methods have achieved solid detection performance, they generally lack robustness and generalization, which are verified in Section 4.4 and Section 4.5.

2.3 Transformer-based Methods

The Transformer [9] is a deep learning model based on the attention mechanism, first employed in machine translation tasks. The fundamental structure of the Transformer comprises an encoder and a decoder. The encoder is tasked with understanding the input data, while the decoder is used to generate the output results.

Given the success of the Transformer architecture in the NLP field, researchers have sought to transfer this architecture to malware detection models. The Transformer is adept at handling long sequences and can be trained using parallel computing. Moreover, it calculates the attention between each API call to identify key API calls and categorizes the API sequences based on this attention.

Numerous LLMs [10, 11, 34, 35, 36, 37] have been proposed, building upon the Transformer model. Some of these models have been applied to malware detection methods. Some research [38, 39, 40] directly utilize the LLM framework to construct detection models, while others [15, 41] refer to the pre-training methods of LLMs, obtain the pre-training weights of the model through self-supervised tasks, and then fine-tune it with specific datasets.

It is important to note that the API sequence and text sequence bear somewhat dissimilarities in terms of statistical characteristics and vocabulary set. Furthermore, the local features of the API sequence are typically more significant than its global features. The weak contextual relevance may affect the application of the Transformer module in the malware detection model.

Refer to caption
Figure 2: The pipeline of the proposed model is divided into two main modules: Representation Generation and Representation Learning. In the Representation Generation stage, explanatory text for API calls is generated using GPT-4. Following this, Bert is used to generate embeddings for the explanatory text, thereby generating the representation of the API sequence. In the Representation Learning phase, a multi-layer convolutional neural network is utilized to extract and subsequently learn feature information from the representation. A fully connected layer is ultimately used to connect to each malware category.

3 Methodology

In this study, GPT-4 is utilized to produce explanatory text for each API call within the API sequence. Given its training on a large-scale corpus, GPT-4 can rephrase and summarize the knowledge associated with API calls via prompt engineering. The prompt texts can guide GPT-4 to generate high-quality explanatory text. Following this, BERT, a large language pre-training model, is employed to generate representations for this explanatory text, which are then concatenated to represent the entire API sequence. The deep neural network is subsequently deployed to extract features from these representations for learning automatically. Finally, the model is connected to various malware code categories through a fully connected layer with a softmax function. The overall architecture of the proposed model is illustrated in Figure 2.

3.1 Representation Generation

To generate a representation of the API sequence, we need to produce the explanatory text for each API call in the sequence. For a more detailed depiction of this process, we define a mapping relationship, Prompt𝑃𝑟𝑜𝑚𝑝𝑡Promptitalic_P italic_r italic_o italic_m italic_p italic_t, wherein we create a sentence for the description and explanation of each API call. We define an API sequence s𝑠sitalic_s with the length of n𝑛nitalic_n as s=[α1,,αn]𝑠subscript𝛼1subscript𝛼𝑛s=[\alpha_{1},...,\alpha_{n}]italic_s = [ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT signifies a single API call. Through prompt engineering, each API call generates descriptive text, and we denote this descriptive text as eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

[e1,,en]=[Prompt(α1),,Prompt(αn)].subscript𝑒1subscript𝑒𝑛𝑃𝑟𝑜𝑚𝑝𝑡subscript𝛼1𝑃𝑟𝑜𝑚𝑝𝑡subscript𝛼𝑛\begin{split}[e_{1},...,e_{n}]=[Prompt(\alpha_{1}),...,Prompt(\alpha_{n})]\end% {split}.start_ROW start_CELL [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = [ italic_P italic_r italic_o italic_m italic_p italic_t ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_P italic_r italic_o italic_m italic_p italic_t ( italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] end_CELL end_ROW . (1)

However, this method consumes a significant amount of computational resources. Consequently, we construct a vocabulary for the API sequence and generate corresponding explanations for each API call within the vocabulary. Subsequently, the explanation text for each API call can be located in the vocabulary, thereby facilitating the reuse of explanation text and significantly reducing computational demands.

Next, we segment the explanatory text using the WordPiece𝑊𝑜𝑟𝑑𝑃𝑖𝑒𝑐𝑒WordPieceitalic_W italic_o italic_r italic_d italic_P italic_i italic_e italic_c italic_e segmentation method, as shown in Eq.( 2).

[𝙲𝙻𝚂],ω1,,ωm,[𝚂𝙴𝙿]=WordPiece(e).delimited-[]𝙲𝙻𝚂subscript𝜔1subscript𝜔𝑚delimited-[]𝚂𝙴𝙿𝑊𝑜𝑟𝑑𝑃𝑖𝑒𝑐𝑒𝑒\begin{split}{\tt{[CLS]}},\omega_{1},...,\omega_{m},{\tt{[SEP]}}=WordPiece(e)% \end{split}.start_ROW start_CELL [ typewriter_CLS ] , italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , [ typewriter_SEP ] = italic_W italic_o italic_r italic_d italic_P italic_i italic_e italic_c italic_e ( italic_e ) end_CELL end_ROW . (2)

It decomposes a word into multiple subwords or characters, proving more effective than the space-based method when handling unknown words, rare words, and complex words. We segment the explanatory text and tokenize it to obtain a sequence of tokens, then adjust the sequences to the same length m𝑚mitalic_m, by truncating the exceeding tokens and padding the shortened sequences with the special token [PAD]. Finally, we incorporate the [CLS] and [SEP] special tokens at the beginning and end of the sequence respectively. A mapping relationship Embed𝐸𝑚𝑏𝑒𝑑Embeditalic_E italic_m italic_b italic_e italic_d is defined as Eq.( 3), generating the vectors that represent each token in the sequence.

𝐞=Embed([[𝙲𝙻𝚂],ω1,,ωm,[𝚂𝙴𝙿]])=[vCLS,v1,,vm,vSEP].𝐞𝐸𝑚𝑏𝑒𝑑delimited-[]𝙲𝙻𝚂subscript𝜔1subscript𝜔𝑚delimited-[]𝚂𝙴𝙿subscript𝑣𝐶𝐿𝑆subscript𝑣1subscript𝑣𝑚subscript𝑣𝑆𝐸𝑃\begin{split}\textbf{e}&=Embed([{\tt{[CLS]}},\omega_{1},...,\omega_{m},{\tt{[% SEP]}}])\\ &=[v_{CLS},v_{1},...,v_{m},v_{SEP}].\end{split}start_ROW start_CELL e end_CELL start_CELL = italic_E italic_m italic_b italic_e italic_d ( [ [ typewriter_CLS ] , italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , [ typewriter_SEP ] ] ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = [ italic_v start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_S italic_E italic_P end_POSTSUBSCRIPT ] . end_CELL end_ROW (3)

A vector representation, denoted as vj(1jm)subscript𝑣𝑗1𝑗𝑚v_{j}(1\leq j\leq m)italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ≤ italic_j ≤ italic_m ), is generated for every token ωksubscript𝜔𝑘\omega_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT within the tokens sequence. The pre-trained BERT is utilized to represent each token, and the dimension of its embedding layer is 768, thus vj768subscript𝑣𝑗superscript768v_{j}\in\mathbb{R}^{768}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 768 end_POSTSUPERSCRIPT. This further creates a representation matrix 𝐞k(1kn)(m+2)×768subscript𝐞𝑘1𝑘𝑛superscript𝑚2768\textbf{e}_{k}(1\leq k\leq n)\in\mathbb{R}^{(m+2)\times 768}e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 ≤ italic_k ≤ italic_n ) ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 2 ) × 768 end_POSTSUPERSCRIPT that corresponds to one API call. Upon obtaining the representation of each API call, a concatenation, denoted as Concat𝐶𝑜𝑛𝑐𝑎𝑡Concatitalic_C italic_o italic_n italic_c italic_a italic_t, is performed on the representations of each API call 𝐞ksubscript𝐞𝑘\textbf{e}_{k}e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the API sequence. This process results in the representation tensor E n×(m+2)×768absentsuperscript𝑛𝑚2768\in\mathbb{R}^{n\times(m+2)\times 768}∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( italic_m + 2 ) × 768 end_POSTSUPERSCRIPT of the API sequence, as shown in Eq.( 4).

𝑬=Concat[𝐞1,,𝐞n].𝑬𝐶𝑜𝑛𝑐𝑎𝑡subscript𝐞1subscript𝐞𝑛\begin{split}\textit{{E}}=Concat[\textbf{e}_{1},...,\textbf{e}_{n}]\end{split}.start_ROW start_CELL E = italic_C italic_o italic_n italic_c italic_a italic_t [ e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] end_CELL end_ROW . (4)

3.2 Representation Learning

The input for previous model methods is a two-dimensional API sequence representation matrix, but the proposed generated representation belongs to a three-dimensional tensor. Consequently, there is a need to design a network architecture capable of accepting a three-dimensional tensor as input, and learning from the representation.

First, to adjust the representation, a depth-wise convolution is performed. The obtained representation is derived from the representation of natural text, which differs somewhat from the API sequence representation. Each embedded channel corresponds to a representation matrix, with each element in the representation matrix having a contextual correlation among the surrounding elements. Specifically, the vertical contextual correlation stems from the explanation text, and the horizontal contextual correlation comes from API sequences. The design of a module for representation adjustments and capturing semantic information is therefore necessary. The trained module can improve the adjustment of the natural text representation for better reflection of API calls and can also capture semantic association information among the surrounding elements.

Considering that the representation of each dimension reflects the specific characteristics of the data and that the correlation of representations across the dimensions is not strong, we employ a per-layer convolutional network to fine-tune each dimension’s representation. Additionally, per-layer convolution can capture the correlation and exceptional features of local data.

Unlike natural text sequences, API sequences exhibit significant local features. Therefore, after adjusting the representation, two-dimensional convolution blocks with varying kernel sizes are used to generate respective feature maps. Max pooling and batch normalization operations are then performed on the feature maps. The max pooling operation can select the maximum value from each feature map, allowing the model to capture the important features in each map. It should be considered that the dimensions of the feature map have different statistical distributions, resulting in significant variations in max pooling results. To standardize the results of max pooling, a batch normalization layer is utilized. Finally, the results are concatenated and each classification category is connected through a fully connected layer containing a softmax function.

Refer to caption
Figure 3: Heat map illustration of the cosine similarity among API call representations produced using different representation techniques, using Aliyun as the training dataset. The representations correlation created by the TextCNN (a) and BiLSTM (b) models have many zero values, while the representations correlation derived via the proposed method (c) are more closely related.

3.3 Analysis of Representation Quality

The method of representation maps discrete API calls to fixed-size continuous vectors. This method facilitates the calculation of the correlation between each API call through these vectors. For example, vectors corresponding to API calls with similar implications are closer in vector space. Therefore, representation vectors, once trained with datasets or other methods, are compelled to learn and mirror semantic associations between API calls more effectively. This enhances the vectors’ ability to deliver a higher quality representation of API calls. Learning based on these high-quality representations, subsequent models can further improve the learning capacity. It is clear that the quality of semantic association in API call representation greatly influences detection performance.

To evaluate the semantic relationships of API calls under different models, we calculate the cosine similarity of API call representation produced by TextCNN, BiLSTM, and the proposed model. The API call representations of TextCNN and BiLSTM are vectors; conversely, the API call representation of the proposed method is a matrix, represented by A and B respectively. The corresponding similarity calculation formula is provided in Eq.( 5). Since there is no negative correlation between the API call representation in the proposed method, to better showcase the differences in representation effects between our method and the previous methods, we utilize the absolute value of cosine similarity as the measure of representation similarity.

Cosine(𝐀,𝐁)=|j=1n(𝐀ij*𝐁ij)|j=1n(𝐀ij2)*j=1n(𝐁ij2).𝐶𝑜𝑠𝑖𝑛𝑒𝐀𝐁superscriptsubscript𝑗1𝑛subscript𝐀𝑖𝑗subscript𝐁𝑖𝑗superscriptsubscript𝑗1𝑛superscriptsubscript𝐀𝑖𝑗2superscriptsubscript𝑗1𝑛superscriptsubscript𝐁𝑖𝑗2\begin{split}Cosine(\textbf{A},\textbf{B})=\frac{|\sum_{j=1}^{n}(\textbf{A}_{% ij}*\textbf{B}_{ij})|}{\sqrt{\sum_{j=1}^{n}(\textbf{A}_{ij}^{2})}*\sqrt{\sum_{% j=1}^{n}(\textbf{B}_{ij}^{2})}}\end{split}.start_ROW start_CELL italic_C italic_o italic_s italic_i italic_n italic_e ( A , B ) = divide start_ARG | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT * B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG * square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG end_ARG end_CELL end_ROW . (5)

The heat map of API call representation association trained on the Aliyun dataset is depicted in Figure 3. For the API call representation generated by TextCNN and BiLSTM, there are approximately 18% of API calls that have almost no correlation with other API calls. One reason for this is that the quantity of training datasets is limited, hence the representation of some API calls cannot be learned. Additionally, after dividing the entire dataset into training and testing datasets, about 15% types of API calls are absent in the training dataset. Consequently, some correlations of API call representation are not learned during the training process, which results in poor generalization performance.

API call representations are produced by the prompt text of GPT-4. So even if certain API calls are absent in the training dataset, the proposed method can generate a representation for these calls and calculate semantic correlations with other API calls. This method is able to generate representations for all API calls, facilitating the calculation of similarity between any two API calls. Hence, the similarity matrix generated by this method is denser and contains more information compared to the previous two methods.

The quality of representation is also a key criterion for evaluating representation generation methods. In the case of API calls, if API calls have similar meanings, then their cosine similarity will be higher. To measure the difference in representational quality between the proposed method and the previous methods, we carry out an analysis of two cases.

Refer to caption
Figure 4: The Comparison of the output content generated by the GPT-4 model using both the direct prompt method and the designed prompt method. The red text represents the prompt text, while the blue text represents the content output generated by GPT-4. The comments on this generated content are indicated in black text. For the answer of the direct prompt method, any text highlighted in Italics exhibits problematic content. The specific issues related to this answer are subsequently outlined in the comments provided to the right of these highlighted contents.

Case 1: wide and narrow characters. Consider a pair of API calls, HttpSendRequestW and HttpSendRequestA, as an example. The only difference between their names is the final letter. These API calls are two different versions of the same function, with many functions in the Windows API having two versions each dealing with Unicode and ANSI. One version, ending with “A”, deals with narrow characters (ANSI), while the other version, ending with “W”, manages wide characters (Unicode). As such, the meanings of these two API calls are virtually identical, and their cosine similarity is close to one. Previous models learned semantic associations among API calls during the training procedure. However, they struggled to learn correlations under conditions of low data quality or a low occurrence of specific API calls in the datasets. Our method generates explanatory texts for the API calls leading to high similarity in the explanatory texts when their meanings are alike.

Case 2: semantic chain analysis. Li et al. [33] proposed a semantic chain method. This method generates four attributes of an API call based on the API call name. These attributes are action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n, object𝑜𝑏𝑗𝑒𝑐𝑡objectitalic_o italic_b italic_j italic_e italic_c italic_t, class𝑐𝑙𝑎𝑠𝑠classitalic_c italic_l italic_a italic_s italic_s, and category𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦categoryitalic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y, collectively forming the semantic chain of the API call. If two API calls share the same semantic chain, their meanings are similar and their cosine distance similarities approach one. The case where the four attributes are identical corresponds to Case 1. Considering the strong distinguishing capacity of the object𝑜𝑏𝑗𝑒𝑐𝑡objectitalic_o italic_b italic_j italic_e italic_c italic_t attribute, we examine a pair of API calls that share identical action, class, and category attributes. The proposed method effectively captures the relationships of API calls with similar semantic chains, posing a challenge for prior representation methods.

Detailed illustrations of the aforementioned two cases, as well as a comparison between the performance of the proposed model and mainstream models on these cases, are provided in Section 5.1.

To further assess the representation quality of the model, we train and test the models with two different datasets, a process known as cross-database experiments. When the similarity of the API call vocabulary is high, the representation formed by the training datasets must adapt to the representation of the test datasets, a process referred to as Representation Adaptation. Conversely, when the vocabularic similarity is low, many API calls in the testing dataset would not have been encountered during training. This lack of significant feature representation poses challenges when testing on the testing dataset. We treat the API call vocabularies of the training and testing datasets as two minimally overlapping domains and apply the domain knowledge learned from the training dataset to the testing domain in a process termed Domain Adaptation. Domain adaptation exerts greater demands on generalization capabilities compared to representation adaptation, making it a considerable challenge for dynamic analysis models. Detailed accounts of the corresponding experiments and analyses are provided in Section 4.4 and Section 4.5.

3.4 Design of Prompt Texts

Once the model achieves a certain scale, its performance significantly improves, demonstrating strong capabilities, such as language comprehension, generation ability, logical reasoning, and so forth. Therefore, designing prompt texts that enable these large-scale models to exhibit such powerful capabilities is worth exploring. Wei et al.[42] suggested an enhanced strategy for generating prompt text, Chain of Thing (CoT). By providing auxiliary prompts for intermediate reasoning steps, CoT allows large models to tackle more complex problems. In this paper, the representation is created by the GPT-4. Even though the GPT-4 is not required to execute complex reasoning, it must paraphrase its learned knowledge taking into account specific requirements. In this process, the prompt text directly influences the quality of the prompt content. Consequently, we take into account the following rules for designing prompt text, and the comprehensive design process of the prompt text is depicted in Figure. 5.

  • Identity Transformation. Yang et al.[43] demonstrated that hypothesizing specific identities and operating environments to the GPT-4 boosts its level of expression and reasoning, and thereby generates higher-quality prompt content. Therefore, we treat GPT-4 like an experienced software security analyst capable of carrying out the malware analysis task with high quality.

  • Restricted Rules. We explicitly instruct GPT-4 not to generate redundant content such as “XXX is a Windows API sequence”. This attribute is a typical characteristic of the API calls and cannot be used to differentiate them. Moreover, GPT-4 is required to present the generated content in a natural text form, without adding special symbols (e.g., “\n”, “\t”) or presenting content in unusual formats (e.g., Markdown format).

  • Length Limitation. We have to accomplish WordPiece tokenization on the generated text and process the token sequence to a fixed length to ensure that the representations produced from each text have the same form. Thus, we explicitly demand that the text created by GPT-4 is restricted to 100 words. Text that is too lengthy will substantially increase the time consumption and computation space required.

Refer to caption
Figure 5: The design process entails the creation of a prompt text, which guides GPT-4 in generating explanatory text of a higher quality.

Finally, we input both the direct prompt text and the designed prompt text into the GPT-4, respectively. The comparison of the content generated by GPT-4 is illustrated in Figure 4. Clearly, if we guide GPT-4 with the designed prompt text, the content created is of superior quality, which finally improves the detection performance.

4 Experiments of Detection Models

Five benchmark datasets are employed to evaluate the performance of the proposed model. The selection of two high-performance models (TextCNN, BiLSTM) is based on detection accuracy for further analysis. To assess the generalization performance of the proposed model, five datasets are classified into two groups, according to the association of the API call vocabulary. Concurrently, representation adaptation experiments are trained and tested within the same group, while domain adaptation representations are tested across different groups.

4.1 Experiment Settings

Implementation Details. All experiments in this paper are carried out on Ubuntu 20.04, utilizing an RTX 4090 GPU and 24 GB of memory. Python 3.9 and Pytorch 2.0 are used to construct the experimental model. Considering the GPU memory capacity limitations, the truncation length of the API sequence is set at 100, and the embedded token sequence of the explanatory text is set at 102 (including the initial [CLS] and final [SEP] token). The batch size is set at 8, and the learning rate is set at 0.001, with the Adam optimizer utilized.

Compared Models. The models used for comparison can be mainly grouped into several categories: RNN-based networks [30, 43, 8, 44], CNN-based networks [31], CNN+RNN-based networks [45, 33, 46], and Transformer-based networks [14, 15].

Datasets. To validate the effectiveness of the proposed model, it is trained or tested using five benchmark datasets of malware dynamic API call sequences: Aliyun [47], Catak [8], GraphMal [48], VirusShare [49], and VirusSample [49]. Based on the similarity of their respective API call vocabularies, the five datasets are divided into two sets:

𝒟base={Aliyun,Catak,GraphMal},subscript𝒟𝑏𝑎𝑠𝑒𝐴𝑙𝑖𝑦𝑢𝑛𝐶𝑎𝑡𝑎𝑘𝐺𝑟𝑎𝑝𝑀𝑎𝑙\mathcal{D}_{base}=\{Aliyun,Catak,GraphMal\},caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT = { italic_A italic_l italic_i italic_y italic_u italic_n , italic_C italic_a italic_t italic_a italic_k , italic_G italic_r italic_a italic_p italic_h italic_M italic_a italic_l } ,
𝒟large={VirusSample,VirusShare}.subscript𝒟𝑙𝑎𝑟𝑔𝑒𝑉𝑖𝑟𝑢𝑠𝑆𝑎𝑚𝑝𝑙𝑒𝑉𝑖𝑟𝑢𝑠𝑆𝑎𝑟𝑒\mathcal{D}_{large}=\{VirusSample,VirusShare\}.caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT = { italic_V italic_i italic_r italic_u italic_s italic_S italic_a italic_m italic_p italic_l italic_e , italic_V italic_i italic_r italic_u italic_s italic_S italic_h italic_a italic_r italic_e } .

The number of vocabulary in the datasets of 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT is significantly higher than in that of 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. Besides, the 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT is more complex and contains abnormal contents. The descriptive statistical features of these five datasets are displayed in Table  I.

With regards to the Aliyun dataset, the proportion of malicious sequences is quite low, which could potentially impede the recall rate of unrecognized malicious sequences. As a solution, the Aliyun and Catak datasets are merged to produce an expanded Aliyun+Catak dataset. The benign sequences originating from the Aliyun dataset form the base of benign sequences in this composite dataset. Conversely, the malicious sequences from all categories within the Aliyun dataset, along with all sequences from the Catak dataset, constitute the malicious sequences in this combined dataset. By increasing the proportion of malicious sequences in the dataset, it is expected to enhance the model’s recall rate for unknown malicious sequences.

TABLE I: Statistics of the 5 benchmark datasets

  Dataset Proportion of benign Proportion of malicious Samples Amount Vocabulary Size of API Call Category Distribution Aliyun 64.15% 35.85% 13887 301 1 kind of benign and 7 kinds of malware Catak 0% 100% 7107 281 8 kinds of malware Aliyun+Catak 23.71% 76.29% 20994 304 1 kind of benign and 1 kind of malware GraphMal 2.46% 97.54% 43876 304 1 kind of benign and 1 kind of malware VirusSample 0% 100% 9795 7964 all are malwares VirusShare 0% 100% 14616 23229 all are malwares  

4.2 Statistical Properties Analysis of Datasets

For a more intuitive understanding of each dataset, the statistical characteristics of the datasets used in the experiments are depicted in Table  I. We construct an API call vocabulary for each dataset and calculate the similarity between each API call vocabulary. IoU is adopted as the similarity measurement criteria, and its formula is provided in Eq.( 6).

IoU(𝒮1,𝒮2)=|𝒮1𝒮2||𝒮1𝒮2|,𝐼𝑜𝑈subscript𝒮1subscript𝒮2subscript𝒮1subscript𝒮2subscript𝒮1subscript𝒮2\begin{split}IoU(\mathcal{S}_{1},\mathcal{S}_{2})=\frac{|\mathcal{S}_{1}\cap% \mathcal{S}_{2}|}{|\mathcal{S}_{1}\cup\mathcal{S}_{2}|}\end{split},start_ROW start_CELL italic_I italic_o italic_U ( caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG | caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG end_CELL end_ROW , (6)

where 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒮2subscript𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the API call vocabulary for different datasets, respectively. The similarity among the API call vocabularies consistently increases as the value of IoU rises.

Refer to caption
Figure 6: The heatmap represents the Intersection over Union (IoU) values of API call vocabularies contained in different datasets.

The IoU values for the API call vocabulary of each dataset are depicted in Figure 6. There is a significant level of similarity among the Aliyun, Catak, and GraphMal datasets, whereas these three datasets exhibit markedly low resemblance to the VirusSample and VirusShare datasets. This disparity is attributed to the dissimilar methods of dynamic feature extraction. Predominantly, Aliyun, Catak, and GraphMal record high-level API calls. By contrast, the VirusSample and VirusShare datasets have a more complex structure. They not only include high-level API calls but are also rich in low-level API calls, with some even presenting anomalies. This results in a comprehensive and considerably greater API call vocabulary for these datasets than for the aforementioned datasets in 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT.

4.3 Performance of the Proposed Model

TABLE II: Comparison of the Performance of Different Detection Models.

  Aliyun Catak GraphMal Method Type Multi(ACC) Binary(ACC) AUC Multi(ACC) Binary(ACC) AUC BiLSTM[30] 82.65% 93.38% 0.9813 49.51% 99.38% 0.9887 BiGRU[29] 81.43% 93.52% 0.9825 49.65% 99.45% 0.9766 CatakNet[8] 82.22% 93.45% 0.9791 49.09% 98.81% 0.9733 ZhangNet[44] RNN-based 77.75% 89.85% 0.9512 40.79% 97.54% / \hdashlineKolosnjaji[45] 81.57% 93.38% 0.9751 45.15% 99.32% 0.9895 LiNet[33] 79.12% 93.74% 0.9522 48.10% 97.54% 0.6010 Mal-ASSF[46] CNN+RNN-based 82.36% 93.81% 0.9815 48.66% 98.98% 0.9390 \hdashlineTextCNN[31] CNN-based 83.44% 94.53% 0.9847 47.96% 99.36% 0.9936 \hdashlineTransformer[14] 75.95% 91.07% 0.9643 37.83% 98.59% 0.9216 MalBert[15] Transformer-based 77.83% 89.99% 0.9579 38.82% 97.49% 0.5003 Embed3D+CNN 82.29% 94.53% 0.9848 52.32% 98.97% 0.9905 Ours CNN-based 85.89% 95.61% 0.9923 62.03% 99.45% 0.9976

We compare the performance of the proposed model with the SOTA model using three datasets (Aliyun, Catak and GraphMal). To validate the effectiveness of the proposed representation generation method, we carry out an ablation study. Keeping the representation learning module unaltered, we employ the embedding layer to create the representation matrix of the API sequence, then duplicate this matrix to construct the representation tensor. This is designed to match the shape (100*102*768absentsuperscript100102768\in\mathbb{R}^{100*102*768}∈ blackboard_R start_POSTSUPERSCRIPT 100 * 102 * 768 end_POSTSUPERSCRIPT) of the representation produced by the proposed method. This method is denoted as Embed3D+CNN.

As shown in Table  II, the proposed model demonstrates improved detection performance on all three datasets in comparison to SOTA methods. The integration of additional external knowledge during training somewhat enhances the performance of the model. However, the extent of this improvement is not substantial. This method essentially employs textual representation as the API call representation, which results in a small discrepancy between these two types of representation. Although the convolutional neural network (CNN) adjusts the representation to bridge the gap between them, the outcome is limited due to the constraints of training on a finite dataset. Moreover, key API calls appear repeatedly across multiple training iterations, giving other models the advantage of learning their representation and consequently achieving commendable detection results.

4.4 Representation Adaptation

In this experiment, the divergence in API call vocabulary between the training and testing sets is small, thus meaning that most API calls in the testing set already exist in the training set.

The focus of this experiment is to verify the generalization performance of the model, with the results displayed in Table  V. The representation generated by the proposed method is of higher quality, with denser associations and enhanced stability. As a consequence, the detection performance of our method surpasses others in cross-database experiments, affirming the generalization performance of our model. Regarding other models, despite their representation associations having many zero values, the key associations have been learned within the training set; thus, these also exhibit a certain degree of detection effect and generalization ability. In the experiment where Aliyun is used as the training set and GraphMal as the test set, the number of malware samples in Aliyun is fewer, leading to a lower recall rate of malware. Consequently, we introduce malware samples from the Catak dataset, utilizing a combined dataset of Aliyun and Catak for training. Upon validating the trained model on GraphMal, both the recall rate of malware and the overall accuracy significantly improve.

4.5 Domain Adaptation

In this experiment, there is a significant divergence in API call vocabulary between the training set and the testing set. As a result, most API calls in the testing set have not been encountered during the training process. Numerous API calls present in 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT do not exist in 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. This disparity introduces significant complications for cross-database experiments.

However, the proposed method can generate explanatory text and corresponding representation for an unseen API call encountered during the training process. This capability significantly reduces the impact of representation absence on the prediction effect of the model. Despite being trained on 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT, the proposed method exhibits high detection performance on 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT. The recall rate of malware is nearly 100%.

By contrast, models, except those trained on the GraphMal dataset, display virtually no prediction ability on 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT when trained using 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. The GraphMal dataset consists of 98% malware samples, creating a prediction bias towards the malware category during testing. Consequently, the recall rate of other methods is able to achieve such a high level.

5 Experiments of Representation Quality

Using the semantic chain similarity of API calls as a reference, we examine the representation quality of various models. Additionally, we investigate the performance of generating representations using API calls from differing sources. Ultimately, we assess the efficacy of the proposed model in addressing the phenomenon of concept drift.

5.1 Comparison of Representation Quality

TABLE III: Comparison of Different Models’ Representation Quality. In instances where the API calls share the same meaning but manage different character types.

  API Call 1 API Call 2 Ours TextCNN BiLSTM RegQueryValueExW RegQueryValueExA 0.8514 -0.1253 -0.1455 WSASocketW WSASocketA 0.7055 -0.8257 -0.4443 SetWindowsHookExW SetWindowsHookExA 0.9478 0.1864 -0.0022 DeleteUrlCacheEntryW DeleteUrlCacheEntryA 0.8507 0 0 HttpOpenRequestW HttpOpenRequestA 0.7520 0 0  

TABLE IV: Comparison of the Representation Quality of Different Models. This pertains to instances where the API calls possess identical semantic chains.

  Semantic Chain API Call 1 API Call 2 action class category Ours TextCNN BiLSTM NtUnloadDriver LdrUnloadDll Update system Unload 0.7904 0 0 NtOpenDirectoryObject NtOpenFile Update file Open 0.8733 0.0544 -0.1711 WSASendTo WSASend Update network Send 0.7568 0 0  

We propose two case studies to measure the representation quality in Section 3.3. The first one is the wide character and narrow character method where the wide and narrow versions of the same function have different API names, yet their representational meanings are remarkably similar. The second method is the representational semantic chain association method. In this case, if the semantic association chains of two API calls are the same, their representational meanings are regarded as similar as well. Some API call examples are analyzed with two case studies, and the results are exhibited in Table III and Table IV respectively.

Our proposed method can generate denser representations and capture the associations between API calls as effectively as possible. Therefore, in both case studies, it accurately generates the association between API call representations. However, in the case of TextCNN and BiLSTM, their representations have to be obtained via dataset training. Hence, the API call association is constrained by the quality of the training datasets. As illustrated in Figure 7, approximately 85% of association degrees in the trained API call representations fall between [-0.25,0.25]. This limitation stems from the poor quality of the dataset, as a result, these methods have difficulty in generating a wider range of API call associations.

Refer to caption
Figure 7: The statistical distribution of API call association similarity. The API call association matrix (Figure. 3) is first flattened and each element in flattened matrix is arranged in ascending order. When the association falls below 25%, we infer that there is virtually no association between the two API calls. In the case of TextCNN and BiLSTM models, approximately 85% of the API call representations lack any association.
TABLE V: Comparison of Model Performance in Representation Adaptation Experiments

  Training Testing Model Precision Recall ACC Aliyun Catak Ours / 51.48% / TextCNN / 50.44% / BiLSTM / 50.96% / \hdashlineAliyun GraphMal Ours 99.40% 30.58% 32.10% TextCNN 99.21% 25.30% 26.94% BiLSTM 98.75% 17.59% 19.40% \hdashlineAliyun+Catak GraphMal Ours 98.31% 62.07% 61.96% TextCNN 98.24% 54.09% 54.27% BiLSTM 98.77% 56.26% 56.65% \hdashlineGraphMal Aliyun Ours 73.71% 91.02% 72.79% TextCNN 68.94% 90.93% 67.91% BiLSTM 69.52% 94.49% 69.89% \hdashlineGraphMal Catak Ours / 99.97% / TextCNN / 85.87% / BiLSTM / 77.59% /  

TABLE VI: Comparison of Model Performance in Domain Adaptation Experiments

  Training Testing Recall Ours TextCNN BiLSTM GraphMal VirusSample 99.15% 98.97% 94.97% Aliyun+Catak VirusSample 100% 17.75% 51.07% Aliyun VirusSample 99.04% 24.19% 54.98% \hdashlineGraphMal VirusShare 99.90% 99.88% 99.97% Aliyun+Catak VirusShare 100% 37.70% 72.94% Aliyun VirusShare 94.53% 24.99% 29.19%  

Refer to caption
Figure 8: The results of few-shot fine-tuning experiments for different models. (a), (b) and (c) are trained on the Aliyun, GraphMal, and Aliyun+Catak datasets, respectively, and tested on the GraphMal, Aliyun, and GraphMal datasets, respectively.

5.2 Comparison of Few-shot Learning

A straightforward approach to few-shot learning involves fine-tuning a model on a support set, with this model being based on one that has already undergone training. The model is then evaluated through a query set[50, 51, 52]. However, in practical applications, due to the significant data drift phenomenon within malware samples, the model trained in the present may not yield satisfactory predictive results on future malware.

In the few-shot learning experiment, different datasets are utilized for training and testing. Additionally, within these testing datasets, a limited number of samples are employed to fine-tune the trained model. As illustrated in Figure 8, the proposed model converges more quickly and shows superior fine-tuning in comparison to both TextCNN and BiLSTM.

Though the representation yielded by the proposed method is frozen, it has both high quality and excellent generalization. Consequently, it is only necessary to adjust subsequent module parameters to adapt to the new dataset distribution expediently. When encountering new samples, TextCNN and BiLSTM parameters within the representation layer require adjustment to adapt to the updated dataset distribution. However, because of the limited sample size, the quality of the representation adjustment is not high, subsequently impairing their fine-tuning performance.

5.3 Comparison of Explanatory Text Acquisition

In order to measure the quality of the explanatory text generated by our proposed method, we employ two explanatory text acquisition methods for comparison: The Document Retrieval (DR) method seeks API calls and collects their meanings from the Windows API reference manual published by the Office Training Center in China 333http://www.office-cn.net/t/api/index.html?web.htm. This reference manual includes explanations and parameter interpretations for common API calls. The Internet Search (IS) method manually searches for API calls on the internet to yield explanations or introductions. The results of these comparisons are presented in Table  VII.

TABLE VII: The performance comparison among various explanatory text acquisition methods is discussed in this section. The term Missing Rate refers to the proportion of API calls that are unsuccessful in obtaining explanatory text, relative to the total number of API calls. Meanwhile, ACC signifies the outcomes of the validation performed on the Aliyun dataset.

  Method Missing Ratenormal-↓\downarrow Length ACCnormal-↑\uparrow IS 40.79% 12.86 81.79% DR 78.62% 244.6 83.15% Proposed 0% 93.7 85.89%  

For certain API calls, the explanatory text cannot be obtained using the DR and IS methods, resulting in a higher missing rate compared to the proposed method. Additionally, the explanatory text from the DR method is too long while the IS method produces overly short text, leading to moderate detection results. The proposed method, however, can generate corresponding explanatory texts for all API calls. Due to the profound knowledge storage of the GPT-4, the explanatory texts are of high quality. The word count is efficiently controlled, leading to improved detection performance. The proposed method also dramatically lowers both manpower and time costs necessary for explanatory text retrieval and eliminates the need for text preprocessing operations.

The relationship between the length of the explanatory text and the model’s performance is explored, as depicted in Figure 9. If the explanatory text is too short, it may not adequately describe the API calls. Conversely, if the explanatory text is too long, it could reach a point of saturation in accuracy where further increases in text length do not improve detection performance. Instead, it uses up more computational time and space (Figure 10). Overly long texts could introduce redundant information potentially deteriorating the model’s detection performance. It’s worth noting that changes in the length of the explanatory text do not cause significant fluctuations in detection performance. Therefore, the length of the explanatory text is not a sensitive parameter.

TABLE VIII: For identical API calls, explanations of both the earlier version and the current version are provided by GPT-4.

  API Call Earlier Version Current Version GetVersionEx In early versions of Windows, the ‘GetVersionEx‘ function could be used to obtain detailed operating system version information, including the major version number, minor version number, build number, platform ID, and additional version information (returned via other members of the ‘OSVERSIONINFOEX‘ structure). However, starting with Windows 8.1 and Windows Server 2012 R2, the behavior of the ‘GetVersionEx‘ function changed. If an application that calls ‘GetVersionEx‘ does not have a manifest declaring its compatibility with Windows 8.1 or higher, then the function will return version information for the highest version of Windows with which the application is compatible, rather than the actual version of the operating system on which it is running. This is because Microsoft wants to encourage developers to program for features, not for operating system versions. CreateWindowEx In earlier versions, it was primarily used to create a window with specified styles, name, position, and size. However, over time, the functionality of the CreateWindowEx function has been expanded and it now includes more parameters and options to support more complex window creation requirements. The CreateWindowEx function has added some new parameters, such as an extended window style parameter (dwExStyle). This parameter allows developers to set some advanced window styles, such as transparent windows, tool windows, and windows with shadows. This means that the modern CreateWindowEx function offers more flexibility and developers can use it to create more complex windows. CreateProcess In Windows XP and earlier versions, the CreateProcess function directly creates a new process from the specified command-line argument. This function does not check whether the executable to be created contains a manifest. A manifest is an XML file that describes one or more assemblies, including name, version number, public key token, etc. However, starting from Windows Vista, the behavior of the CreateProcess function has changed. Now, when you call the CreateProcess function, it first checks whether the specified executable file has a manifest. If there is a manifest, CreateProcess uses the information in the manifest to create a new process. This may result in a different behavior of the CreateProcess function in newer Windows versions if the executable file contains a manifest.  

Refer to caption
Figure 9: The impact of explanatory text length on the model’s detection performance.
Refer to caption
Figure 10: The correlation among FLOPs, API sequence length and explanatory text length. As the length of the API sequence and explanatory text increases, FLOPs markedly rise, necessitating more GPU memory space for storage, along with an increase in computation time.

5.4 Analysis of Concept Drift Alleviation

The phenomenon of data distribution evolution over time, which impacts the detection performance of models, is known as concept drift [53, 54]. An effective method to address this situation is incremental learning. By learning from new data, detection models can recognize evolving data distributions and enhance their capacity to detect novel samples. This phenomenon is particularly noticeable in the realm of malware API call behavior. Changes, such as the introduction of new API calls and updates to existing API calls, can influence the model’s detection performance.

In Section 4.5, 𝒟largesubscript𝒟𝑙𝑎𝑟𝑔𝑒\mathcal{D}_{large}caligraphic_D start_POSTSUBSCRIPT italic_l italic_a italic_r italic_g italic_e end_POSTSUBSCRIPT introduces some new API calls compared to 𝒟basesubscript𝒟𝑏𝑎𝑠𝑒\mathcal{D}_{base}caligraphic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. With the assistance of GPT-4, our method can generate representations for unknown or newly introduced API calls, and achieve an excellent recall rate of malware, thereby demonstrating the ability to handle concept drift to a certain degree. Besides, it can also monitor the latest interpretations of API calls. Given that the knowledge obtained through GPT-4 is continually updated, this is advantageous for ongoing learning to manage the concept drift phenomenon. Furthermore, it allows the tracking of API call explanations over specific periods, with some examples demonstrated in Table  VIII.

6 Conclusion

This paper proposes a non-training representation generation method using GPT-4 prompts. We first design the prompt text to guide GPT-4 to generate the explanatory text of each API call, then perform pre-trained BERT to generate the representation of each explanatory text, and finally, a CNN-based module is constructed to learn the representation, thereby achieving excellent detection performance of the proposed model. The generation of this representation is not reliant on malware datasets training and, theoretically, it can generate the representation for all API calls. Consequently, this method can effectively address the issues of weak generalization and concept drift. The detection performance, particularly the generation capacity, of our proposed model has seen improvements when compared to SOTA models.

In future work, we aim to collect more datasets of malware representation and analyze them. Through this, we strive to provide a solid foundation for the creation of a large-scale model specifically designed for malware representation and detection.

References

  • [1] N. Guizani and A. Ghafoor, “A network function virtualization system for detecting malware in large iot based networks,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 6, pp. 1218–1228, 2020.
  • [2] A. Amira, A. Derhab, E. B. Karbab, and O. Nouali, “A survey of malware analysis using community detection algorithms,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–29, 2023.
  • [3] M. Gopinath and S. C. Sethuraman, “A comprehensive survey on deep learning based malware detection techniques,” Computer Science Review, vol. 47, p. 100529, 2023.
  • [4] D. Uppal, R. Sinha, V. Mehra, and V. Jain, “Malware detection and classification based on extraction of api sequences,” in 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).   IEEE, 2014, pp. 2337–2342.
  • [5] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, “Malware classification with recurrent networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 1916–1920.
  • [6] B. Athiwaratkun and J. W. Stokes, “Malware classification with LSTM and GRU language models and a character-level CNN,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 2482–2486.
  • [7] S. Maniath, A. Ashok, P. Poornachandran, V. Sujadevi, P. S. AU, and S. Jan, “Deep learning LSTM based ransomware detection,” in 2017 Recent Developments in Control, Automation & Power Engineering (RDCAPE).   IEEE, 2017, pp. 442–446.
  • [8] F. O. Catak, A. F. Yazı, O. Elezaj, and J. Ahmed, “Deep learning based sequential model for malware analysis using windows exe API calls,” PeerJ Computer Science, vol. 6, p. e285, 2020.
  • [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [11] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [13] H. Yao, J. Lou, K. Ren, and Z. Qin, “Promptcare: Prompt copyright protection by watermark injection and verification,” in IEEE Symposium on Security and Privacy (S&P).   IEEE, 2024.
  • [14] F. Demirkıran, A. Çayır, U. Ünal, and H. Dağ, “An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,” Computers & Security, vol. 121, p. 102846, 2022.
  • [15] Z. Xu, X. Fang, and G. Yang, “Malbert: A novel pre-training method for malware detection,” Computers & Security, vol. 111, p. 102458, 2021.
  • [16] OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2022.
  • [17] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  • [18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [19] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in the 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 320–335.
  • [20] M. Alazab, S. Venkataraman, and P. Watters, “Towards understanding malware behaviour by the extraction of api calls,” in 2010 Second Cybercrime and Trustworthy Computing Workshop.   IEEE, 2010, pp. 52–59.
  • [21] S. Gupta, H. Sharma, and S. Kaur, “Malware characterization using windows api call sequences,” in Security, Privacy, and Applied Cryptography Engineering: 6th International Conference.   Springer, 2016, pp. 271–280.
  • [22] C. Ravi and R. Manoharan, “Malware detection using windows api sequence and machine learning,” International Journal of Computer Applications, vol. 43, no. 17, pp. 12–16, 2012.
  • [23] Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015.
  • [24] A. Sami, B. Yadegari, H. Rahimi, N. Peiravian, S. Hashemi, and A. Hamze, “Malware detection based on mining api calls,” in the 2010 ACM Symposium on Applied Computing, 2010, pp. 1020–1025.
  • [25] A. Pektaş and T. Acarman, “Malware classification based on api calls and behaviour analysis,” IET Information Security, vol. 12, no. 2, pp. 107–117, 2018.
  • [26] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph-based malware detection using dynamic analysis,” Journal in computer Virology, vol. 7, pp. 247–258, 2011.
  • [27] P. Shijo and A. Salim, “Integrated static and dynamic analysis for malware detection,” Procedia Computer Science, vol. 46, pp. 804–811, 2015.
  • [28] R. Islam, R. Tian, L. M. Batten, and S. Versteeg, “Classification of malware based on integrated static and dynamic features,” Journal of Network and Computer Applications, vol. 36, no. 2, pp. 646–656, 2013.
  • [29] L. Yuan, Z. Zeng, Y. Lu, X. Ou, and T. Feng, “A character-level BiGRU-attention for phishing classification,” in Information and Communications Security: 21st International Conference, ICICS 2019.   Springer, 2020, pp. 746–762.
  • [30] D. Dang, F. Di Troia, and M. Stamp, “Malware classification using long short-term memory models,” arXiv preprint arXiv:2103.02746, 2021.
  • [31] B. Qin, Y. Wang, and C. Ma, “API call based ransomware dynamic detection approach using textcnn,” in 2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE).   IEEE, 2020, pp. 162–166.
  • [32] Y. Kim, “Convolutional neural networks for sentence classification,” in the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014.   ACL, 2014, pp. 1746–1751.
  • [33] C. Li, Q. Lv, N. Li, Y. Wang, D. Sun, and Y. Qiao, “A novel deep framework for dynamic malware detection based on API sequence intrinsic features,” Computers & Security, vol. 116, p. 102686, 2022.
  • [34] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, 2020.
  • [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  • [36] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” in the 57th Conference of the Association for Computational Linguistics.   Association for Computational Linguistics, 2019, pp. 2978–2988.
  • [37] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 5754–5764.
  • [38] A. Rahali and M. A. Akhloufi, “Malbert: Malware detection using bidirectional encoder representations from transformers,” in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC).   IEEE, 2021, pp. 3226–3231.
  • [39] D. Demırcı, C. Acarturk et al., “Static malware detection using stacked bilstm and gpt-2,” IEEE Access, vol. 10, pp. 58 488–58 502, 2022.
  • [40] A. Rahali and M. A. Akhloufi, “Malbertv2: Code aware bert-based model for malware identification,” Big Data and Cognitive Computing, vol. 7, no. 2, p. 60, 2023.
  • [41] M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Cordeiro, M. Debbah, and T. Lestable, “Revolutionizing cyber threat detection with large language models,” arXiv preprint arXiv:2306.14263, 2023.
  • [42] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  • [43] Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),” arXiv preprint arXiv:2309.17421, 2023.
  • [44] Z. Zhang, P. Qi, and W. Wang, “Dynamic malware analysis with feature engineering and feature learning,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, vol. 34, no. 01, 2020, pp. 1210–1217.
  • [45] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Advances in Artificial Intelligence: 29th Australasian Joint Conference.   Springer International Publishing, 2016, pp. 137–149.
  • [46] S. Zhang, J. Wu, M. Zhang, and W. Yang, “Dynamic malware analysis based on api sequence semantic fusion,” Applied Sciences, vol. 13, no. 11, p. 6526, 2023.
  • [47] Alibaba Cloud, “Alibaba cloud malware detection based on behaviors,” 2018, [Online; accessed 11-November-2018]. [Online]. Available: https://tianchi.aliyun.com/getStart/information.htm?raceId=231694
  • [48] A. Oliveira and R. Sassi, “Behavioral malware detection using deep graph convolutional neural networks,” TechRxiv, p. preprint, 2019.
  • [49] khas ccip, “Api sequences malware datasets,” 2021, 2023-10. [Online]. Available: https://github.com/khas-ccip/api_sequences_malware_datasets
  • [50] Y. Chai, L. Du, J. Qiu, L. Yin, and Z. Tian, “Dynamic prototype network based on sample adaptation for few-shot malware detection,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 5, pp. 4754–4766, 2022.
  • [51] K. Tran, H. Sato, and M. Kubo, “Mannware: A malware classification approach with a few samples using a memory augmented neural network,” Information, vol. 11, no. 1, p. 51, 2020.
  • [52] P. Wang, Z. Tang, and J. Wang, “A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling,” Computers & Security, vol. 106, p. 102273, 2021.
  • [53] N. Lu, G. Zhang, and J. Lu, “Concept drift detection via competence models,” Artificial Intelligence, vol. 209, pp. 11–28, 2014.
  • [54] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 2346–2363, 2018.