Printer Friendly

An Attempt toward Authorship Analysis of Obfuscated .NET Binaries.

1 INTRODUCTION

Authorship attribution means assigning unknown programs to their correct authors, and this has been studied extensively for source codes and binary files but so far not much of an attempt has been done regarding authorship attribution of obfuscated code or binaries. Since software developers mostly protect their codes/binaries against reverse engineering by employing various obfuscation techniques, it is implausible to find a clean code/binary in a malware analysis case or even a commercial program. These methods also can be used to compress the binaries or reduce the compiled footprints. Code obfuscation is a program alteration technique that makes a transformed program harder to understand and reverse engineer but maintains its functionality.

To carry out an accurate authorship attribution, previous code/binary samples of intended programmers must be collected and analyzed to identify unique patterns for each programmer. These patterns may include the author's code stylometry or features extracted from author binaries.

Using the extracted features, typically a classifier needs to be trained and tested for evaluating its Authorship attribution has several applications including plagiarism detection, assisting with resolving legal disputes over the authorship of documents and programs in a court law, identification of cybercriminals in case of a cyber-attack (e.g., malware-infection) and other forensic issues.

2. BACKGROUND

Microsoft .NET provides a rich and convenient framework known as the Class Library, which offers a library of tested, reusable code that programmers can call from their applications. At the heart of the .NET platform, Common language runtime (CLR) exists which is the execution engine that handles running applications. As shown in Figure 2, a high-level source code which is written in one of the supported languages such as C#, VB.NET, Visual C++.NET, F#, etc. initially, needs to be compiled to an intermediate assembly code known as the Common Intermediate Language (CIL). This feature makes routines written in one language accessible to other languages, and hence programmers can focus on creating applications in their preferred languages. Finally, using Just in Time (JIT) compiler, the CIL code will be compiled to native code that is compatible with the underlying operating system for execution. JIT compiler typically considers features such as the operating system and hardware configuration to make an optimal executable file for various platforms.

The procedure above is known as managed code execution model, but .NET framework supports unmanaged code execution as well in which the high-level code will be directly compiled to native code.

Compiling a .NET project produces an assembly that includes intermediate language (CIL) instructions, managed resources and some metadata for describing the types, methods, properties, fields, and events in the assembly. The metadata and the high-level nature of CIL instructions make it possible to understand the assembly structure and the method instructions to decompile it to a high-level source code. In many cases, the generated source code looks similar to the original source code used by the compiler. It lacks code formatting and comments, but it has all the types and routines[18]. This information can simplify a reverse engineering task.

To analyze packed binaries or malware, a good understanding of the obfuscation techniques is a necessity. Packers and obfuscators are designed to make reverse engineering of a protected file more challenging.

Typically, to analyze an executable file, the first step is to extract the program's machine-language (assembly) code and especially the executable section of the binary file is more important. To make a normal binary, compilers usually put the executable code in a .text section which is clearly marked as the only executable section of a typical PE file. The code bytes of a binary which is not packed or obfuscated also can be extracted from the program's memory image at any point during its execution, as the code does not change at runtime and PE file formats indicate which sections of the program are executable [27].

In case of a protected binary, code extraction is more difficult because the binaries may create and overwrite the code at run-time.

2.1 Obfuscation Techniques

As mentioned before the MSIL and metadata in assemblies provide enough information to recover the original code quickly. To mitigate this problem, programmers often employ obfuscation techniques to transform and protect a .NET assembly (without affecting its functionality) so that it becomes difficult or impossible for a software cracker to reverse engineer it. This section briefly explains the most important obfuscation techniques that current tools [1] utilize to protect .NET binaries.

2.1.1 Name Obfuscation

Name obfuscation is the most primary method that is used by most of the .NET obfuscator tools. This technique usually renames namespaces, class names, method signatures, and fields as well as methods implementation and string values of your assembly. Name obfuscation makes the decompiled source harder to understand but the overall flow of the code is not obscured. The new names can follow numbers, characters from non-Latin scripts, unprintable characters or invisible characters. Names may be used multiple times in a scope by using overloading (Figure 3). While proper names are technically not required to execute the assembly, the resulting assembly would be unverifiable[18].

2.1.2 Control Flow Obfuscation

Many advance decompilers can reconstruct the code including the exact structure of loops, ifelse statements, method calls, etc. which make it very easy to reverse-engineer the executable. Control flow obfuscation means changing the execution path of a process. Many .NET obfuscators would replace CIL instructions produced by the compiler with several "goto" commands and other instructions that may not be decompiled into a valid source code. This process may affect the runtime performance of a method [18], and the final result is code that is semantically equivalent to the original but contains no clues as to how the code was originally written [29].

2.1.3 String Encryption

In a managed .NET binary all strings are identifiable and readable. Literal strings often contain sensitive information such as login information, passwords, SQL queries, error messages that are displayed to the user and algorithm parameters. Those strings can be tracked down to the code that uses them. String encryption works by modifying or encrypting all strings in the assembly and restoring their original value at runtime. The algorithm that decodes the data is always included in the obfuscated assembly. This process may affect the runtime performance of the program, either once at startup or for every string usage [18].

2.1.4 Method Call Hiding/Redirection

The way CIL instructions work, references to external types and methods are visible and will be unaffected by name obfuscation and control flow obfuscation. Even without reasonable names, the fact that a method makes use of certain framework classes like I/O, networking or cryptography can draw attention to it. Calls to suspicious methods can be redirected through a generated method that only wraps the original call. This wrapper method can be renamed, and the called method's name will no longer appear in the obfuscated method body [18] (Figure 4). This delivers very strong obfuscation and makes it hard to control when, where and how such methods are used. The Just-In-Time compiler (JIT) often inlines such short wrapper methods so that it does not affect runtime performance.

2.1.5 Resource Encryption

A .NET product may include resources such as icons, sounds, images, etc. Several commercial (1) and free (2) tools exist which can extract resources from a .NET binary. Such resources can often contain sensitive or copyrighted information. Resource encryption can encrypt all such resources so that it is impossible to extract them from the assembly.

2.1.6 ILDASM Protection

ILDASM (Microsoft IL Disassembler) is a free tool to disassemble any .NET assembly into MSIL (Microsoft Intermediate Language). ILDASM protection modifies the assembly in such a way that ILDASM refuses to disassemble the assembly. ILDASM protection is based on injecting a piece of code into an assembly that causes ILDASM to crash and unable to handle the binary.

2.1.7 Tamper Detection

Tamper detection inserts code that verifies your application's integrity at runtime. If it detects tampering, it can shut down the application, invoke random crashes (to disguise that the crash was the result of a tamper check), or perform any other custom action [29]. Tamper-proof technology mainly encrypts the string and binary resources to protect them from unauthorized copying, modifications, tampering, spoofing, and malicious code injection.

2.1.8 Anti-debugging Obfuscation

This technique uses several heuristic tests to detect if the binary is running under a debugger or tracer. If it can detect a presence of a debugger, generally, an exception is thrown, and the program will terminate providing a strong defense against crackers and hackers trying to debug or trace the binary for various purposes [28].

2.1.9 Code and Data Virtualization

Code virtualization translates the .NET bytecode to an entirely unrecognizable random-generated op-code sequence, which still perfectly functions at the runtime. Code virtualization changes the CIL code into virtual op-codes that will only be understood by a secure virtual machine. As opposed to protecting CIL code through encryption where the encrypted code must be decrypted back into CIL before the CLR can execute it, code virtualization uses a virtual machine which directly processes the protected code in the form of a virtual machine language. Code virtualization feature is by far the strongest protection method available in code protection arena today as it implements a one-way code transformation. The code is never translated back to its original form; instead, the virtual machine emulates the original code behavior. Code virtualization can significantly degrade performance and make debugging very difficult [18].

2.1.10 Other methods of protecting binaries

There are several other techniques that programmers may use to protect their code. Kevin. A Roundy et al., in [27] have done a survey study on available binary level obfuscation techniques. For instance, malware are known to use code packing methods to protect themselves from static analysis and reverse engineering. Malware writers apply packing tools to create a packed version of a given regular binary. A packed binary contains a payload of encrypted or compressed original code that during execution will be unpacked with the unpacker routine into its address space at run-time.

In general, once a packed program is executed, the packed code is loaded up in the memory. The program run starts at Original Entry Point (OEP). OEP is the starting point of the program and the first instruction from where the execution will begin. In the case of a packed exe, the OEP points to the start of the unpacking routine. This is because unless the unpacking routine executes, the original code can't be unpacked. When the unpacking routine has finished its run, the execution pointer jumps to the first instruction of the original program. For example, UPX (Ultimate Packer for Executables) (3) is a free and open-source executable packer which mainly compresses the executable rather than obfuscating them. Authors may use this technique for faster loading their program into memory due to low file size. As shown in Figure 5, a UPX packer consolidates all sections of a PE file into a single section called UPX1.

This section contains the unpacking routine as well. There are two other sections: UPX0 which is an uninitialized space, and at the run-time, the unpacking routine in UPX1 decompresses the packed code into UPX0. Finally, UPX2 includes data and imports table.

3 RELATED WORKS

I couldn't find any published work for the authorship attribution of obfuscated .NET binaries. Most of earlier works on software authorship attribution mainly rely on features that can be obtained from the source code. Attributing authorship of source codes with unknown authors has been studied extensively for several decades but less so for binaries. In some prominent studies such as [14], [2], authors have utilized machine learning techniques to correlate syntax-based features with authorship to identify the author of program binaries. In [3], authors have analyzed the effects of compiler optimization (in three levels), removing symbol information and applying basic binary obfuscation methods (such as instruction replacement and control flow graph obfuscation) on several features mainly obtained from disassembling and decompiling the executable binaries (e.g. token n-grams and features driven from the Abstract Syntax Tree).

The reset on this section briefly explains some of the earlier studies on authorship attribution of source code and binaries.

On the second day of November 1988, many computers on the Internet were infected with a worm program. Later, Spafford and Eugene H in [4] analyzed the reverse-engineered malware. Coding style and methods used in the program were manually analyzed, and conclusions were drawn about the author's abilities and intent. Following this experience, Spafford and Weeber [5] suggested that software forensics could be feasible to analyze the remnants of software after a malware attack and identify its author. The authors investigated two different cases where code remnants might be examined: executable code and source code. They believed executable code, even if optimized, still contains many features such as data structures and algorithms, compiler and system information and so on that may help to identify the author.

In a recent survey study [6], previous attempts at attributing authorship of normal source code are categorized by two attributes: the source code metrics used for the classification, either strings of n tokens/bytes (n-grams) or object-oriented metrics such as number of classes, interfaces, etc.; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior.

In studies such as [7], [8] and [9] authors have used features such as the programming layout metrics (e.g. use of white space and placement of brackets, number of lines, proportion of uppercase characters and number of instances of a particular feature per line of code), programming style metrics (e.g. average comment length and average variable length, etc.) and programming structure metrics (such as average function length and usage of common data structures). Moreover, in [10] and [11] authors have used features such as the source code's cyclomatic complexity and Halstead's metrics for the software forensics.

In [9], authors have used a combination of 56 object-oriented metrics and other structural features for the source code authorship attribution of Java programs. Some of the contributive metrics in their effective study include: percentage of open braces ({) that are the last character in a line, average indentation in white spaces after open braces ({)), average indentation in tabs after open braces ({)), percentages of condition lines where the statements are on the same line as the condition, average white spaces to the left side of operators and so on.

Burrows et al. [12], have produced source code n-grams as their feature set. They used an information retrieval model for determining the author of a document; they used it as a search engine query on a corpus with known authors. Their approach produces a list of all documents in the corpus ranked by estimated similarity to the query document. Ideally, every other corpus document by the query author would appear at the top of the ranked list, followed by all other documents.

In [13], authors have created a new feature set that reflects programmer's coding style from properties mainly derived from abstract syntax trees (AST). The authors believe abstract syntax tree (generated from the source code using a fuzzy parser) represents the structure of a program, and it is obfuscation resistant, and therefore, it can be used to identify the author of an obfuscated code. In their research, by extracting layout and lexical features from source code and by them with features derived from AST such as the AST node's average depth, the AST node's TF-ID (term frequency--inverse document frequency), etc. they claimed 99% accuracy in classification with 36 authors each with ten files.

In another study [14], authors initially have extracted op-code idioms (all possible sequences of 1 to 3 op-codes), Graphlets (three-node subgraphs of the control flow graph), super graphlets, Call Graphlets, byte-level N-grams and Library calls for each binary as their feature set. Then a subset of the features that correlate with programmer style is selected. Based upon these stylistic features, an SVM classifier was trained and tested. For programs with unknown authors, the k-mean clustering technique has been employed to group source codes by stylistic similarity. To increase the accuracy of the clustering method, the stylistic features were determined according to the previous classification experiment. Matching byte-level n-grams and finding occurrence patterns in them have been explored in few studies such as [15] and [16].

Rest of this paper is structured as follows:

* Sections 4.1 - 4.3, explain the sample collection and data preparation methods.

* Section 4.4 - 4.6, describe tests for evaluating the obfuscator tool (that I've used in the experiment) and identifying the obfuscation resistant features.

* Section 5, concludes the study and gives some suggestions to extend this research.

4 THE EXPERIMENT

This study has six phases:

4.1 Sample collection

The .NET platform and C# programing language are widely in use by programmers for rapid application development.

We decided to collect a total number of 54 C# source codes written by ten different programmers who have participated in google code jam 2013 competition [17]. The obtained source codes include submitted solutions for 13 different programming challenges.

4.2 Generating binary files for each source code

For each submission, a separate C# project using visual studio 2015 was created. Then the necessary header files and other resources (as defined in the source code) were added to the solution. Finally, the answers were compiled and the generated binaries were saved and labeled according to their corresponding programmer and programming competition.

4.3 Applying different obfuscation techniques

Obfuscation techniques mangle flow of the program, variable types, and class/method names, etc. mainly to convert a program into an equivalent one that is much harder to reverse engineer. The advantage of using an obfuscation technique is that the resultant program will still run on standard hardware and without applying any changes to the executing machines. There are several open source and commercial obfuscators for .NET assemblies. A comparative list of .NET obfuscators is available in [1]. In this step, I generated nine different obfuscated versions of each standard binary. For this purpose, I utilized the ConfuserEx obfuscator [19] which is a free, open-source obfuscator/protector for .NET assemblies. The obfuscation techniques (provided by ConfuserEx) that I used in the experiment include:

* Control flow obfuscation: Control Flow obfuscation significantly alters the normal flow of a program. CFG Obfuscation converts the code inside the methods into spaghetti code, which means while retaining the code's functionality, makes it extremely difficult for human and decompilers to follow the program logic. Most of the decompilers cannot reverse engineer the spaghetti code back to source code.

* Variable/Symbol renaming: This technique renames all classes, methods, and fields to short names and mainly to unprintable characters or complex sequences in attempts to foil decompiled output. This protection obfuscates the symbols' name as well so the decompiled source code can neither be compiled nor read.

* Anti-memory dumping: "Memory Dumping" is a process of taking a snapshot of the executable. That is to say, capturing the state of the executable at a particular point in time. This obfuscation technique makes memory dumping quite harder mainly by use of encryption technologies.

* Anti-debugging: This protection prevents the assembly from being debugged or profiled.

* Anti ILDASM: This technique suppresses decompilation process using tools such as ILDASM (Microsoft Intermediate Language disassembler).

* String encryption: This obfuscation method makes it difficult for a software cracker to understand the logic of the code, as he cannot identify the text of messages or other used strings, making it much harder to determine where to patch the code.

* Resource encryption: This feature improves software protection by compressing and encrypting the used resources. At runtime, when required the resources are automatically decompressed and decrypted [31].

* Invalid Meta Data Insertion: This protection adds invalid metadata to modules to prevent disassembler/decompiler from opening them.

* Method reference hiding (Reference Proxy Protection): This protection encodes and hides references to type/method/fields. It means this technique will add an indirection method as a proxy.

4.4 Evaluating the performance of ConfuserEx obfuscator

I decided to investigate how well the ConfuserEx has obfuscated the binaries. My observation revealed that the structure of the files (as shown in Table 1), their control flow graphs, the number of found op-codes and the disassembled source code of the obfuscated binaries significantly differ from the original binaries which means that the ConfuserEx has protected and manipulated the files nicely.

Figure 6, shows a control flow graph of one of the sample binaries while Figure 7, depicts the control flow graph of the same binary after applying the CFG obfuscation using ConfuserEx tool.

4.5 Feature extraction from original and obfuscated binaries

In this step, I compared the value of some features of the binaries before and after using different obfuscation techniques. The features that I have considered in this experiment includes: op-code frequencies, op-code n-grams, PE header information and variables derived from the program's control flow graph (such as the number of nodes, number of edges, number of branches, number of terminal nodes, number of isolated nodes and maximum, minimum and average degree of the nodes).

4.6 Identifying the obfuscation resistant features

To determine the author of a given binary, I was looking for features that fulfill the following criteria:

1. For each author, the candidate feature value should remain same after applying various obfuscation techniques.

2. Statistically, the value of the candidate feature should significantly differ between the individual authors.

Any feature which satisfies the criteria mentioned above can be used for later author classification tests.

After calculating and comparing the features above, my observations showed that some opcode frequencies and op-codes n-grams are obfuscation resistant and their values differ for different authors. Hence, these feature sets can represent the programming style of the authors even after applying obfuscation techniques.

4.6.1 Effectiveness of the Op-code frequencies for authorship attribution of the binary files

This section describes the procedure for extracting op-codes, calculating their frequencies and exploring their usefulness for authorship attribution of binary data. Op-codes have been used for authorship attribution purpose of binary source codes in some previous studies such as [14]. Our initial assumption is based on the differences between the op-code frequencies of the generated binaries that make a data set for each programmer. As mentioned before, I disassembled each binary to get the assembly code that contains the op-codes. Then I obtained the frequencies of each op-code for each document.

After defining hypothesis, I utilized the Least Significant Difference (LSD) and 2-way Analysis of Variance (ANOVA) with replication tests to statistically infer whether the op-code frequencies can be used to discriminate between samples of different authors. In short, the following steps outline my method:

Step 1: Disassembling the binaries to retrieve the machine (assembly) code.

Step 2: Extracting op-codes and calculating their frequencies for each sample.

Step 3: Creating a database of the calculated frequencies for all sample files.

Step 4: Removing op-cods with frequencies less than 0.0001% to decrease noise in the dataset.

Step 5: Performing Fisher's LSD test to infer if op-code frequencies are obfuscation resistant. That is to say, to check if rates of some op-codes remain same after applying various obfuscation techniques to all samples of a particular author.

Step 6: Performing Multiple ANOVA test to see if op-code frequencies have a meaningful difference (statistically) between samples of different authors.

Step 7: Applying attribute selection algorithms (if necessary) to select efficient op-codes and further reduce noise in the dataset and then repeating step 3 to 6.

First, I loaded the samples into IDA disassemble (1). IDA translates binary content and generates assembly language source code from the machine-executable file format. It supports a variety of executable formats (including .NET assemblies) for different processors and operating systems.

After loading each sample into IDA, by running the InstructionCounter (which is an IDA plugin written in Python), I extracted the op-code statistics. For each binary, one text file was generated, which contain the frequency of opcodes in the corresponding binary file. Subsequently, these text files were imported into a MS-Excel spreadsheet and were augmented with the complete list of X86 instruction set available at [20], totaling 794 op-codes. After removing infrequent op-codes with frequencies less than 0.0001%), a total number of 131 opcodes were considered for further exploration.

Fisher's LSD Test Results

To see if op-code frequencies are obfuscation resistant, I considered op-code frequencies of standard binary and the nine obfuscated versions of it.

In the IBM SPSS software, a 10X131 contingency table was designed (rows represent the standard binary and its obfuscated versions, columns represent the op-code frequencies). Then I ran the LSD two-step test. First, an ANOVA test was performed. If it is significant at level ALPHA, then all pairwise T-Tests are carried out. If the ANOVA test result is not significant, then the procedure terminates. Table 2 and Table 3, summarize the LSD test results. In Table 3, number 1 represents a standard binary, and 2-10 indicate obfuscated versions of it. As shown in the tables, the significant value of LSD test is not very close to 1; therefore, I could anticipate that frequencies of some op-codes may remain same even after applying obfuscation methods.

Multiple ANOVA test

Using IBM SPSS, a 540 X 131 contingency table was created (rows represent samples of different authors; columns represents the frequencies of op-codes). I defined the Null and Alternate hypothesizes as follows:

[H.sub.0]: Statistically, there is no significant difference between the op-code frequencies across different samples of various authors (comprising 13 different programming categories).

[H.sub.A]: Statistically, there are some associations between the op-codes and the corresponding authors which can help to differentiate programmers. MANOVA test results are listed in Appendix A.

The MANOVA test results explain that frequencies of several op-codes (with sig value close to one) differ significantly across different authors which reject the Null hypothesis and confirms the alternate hypothesis. For instance, the subsequent Tukey HSD and Fisher's LSD tests (Appendix A) for the op-code "conv.i" show how the mean frequency value of this opcode differs for ten authors.

Training and testing a classifier

After confirming the usefulness of the op-code frequencies as an obfuscation resistant feature, I proceed further by training and testing some classifiers.

We examined the data with classifiers such as Decision Trees, Support Vector Machine (SVM), Naive Bayes classifier, Lazy classifiers such as K-NN and Artificial Neural Network base classifiers. Our Observation revealed that the Random Forest classifier in combination with bagging technique yielded the best accuracy around 93%.

Random Forests [21] is an idea of a more general technique of random decision forests that are an ensemble learning method for classification. A Random Forests classifier constructs many decision trees at training time and outputting the class that is the mode of the categories of the individual trees. Bootstrap aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed mainly to improve the stability and accuracy of decision tree based statistical classifiers. By using the bagging technique, I could further reduce variance and avoid over-fitting. After getting the classification results, I compared the observed accuracy with the base Zero-R classifier (Classification solely by chance). This classifier is the most straightforward method that depends on the target labels only, and it ignores all predictors. It just predicts the majority class by merely counting the number of instances belonging to that category. While there is no predictability power in ZeroR, it is suitable for determining a baseline performance as a benchmark for other classification methods.

Our research shows that the Zero-R classifier could only produce around 14% accuracy. This comparison obviously proves the effectiveness of the proposed method. The classification results using Random Forests classifier and Bagging technique is available in Appendix B. According to the confusion matrix (Appendix B), most of the binaries were labeled correctly, but in case of one author (Author2), the classifier roughly could identify 50% of his binaries.

4.6.2 Use of op-code n-grams for authorship attribution

An op-code n-gram is a contiguous sequence of n op-codes from a given machine code (assembly) file. An assembly file contains many sections such as segment names, various addresses, op-code operands, and comments. Op-code n-grams have been used for authorship attribution in some previous studies such as [22] and [14]. In this research, I only considered the op-codes because they represent machine instructions that define characteristics of a binary. We wrote a program to extract the opcodes from the assembly files and remove other segments such as comments, etc. For the experiment, I considered minimum 2 and maximum five consecutive op-codes, and that is to say, I generated all possible op-code 2-grams, 3-grams, 4-grams, and 5-grams. The complete procedure can be viewed as the following sequence:

I. Extracting op-code segments for disassembled binaries and saving the results in a text file.

II. Passing generated text files to Rapid Miner data mining software for further processing.

III. Tokenizing the op-codes.

IV. Generating op-code n-grams.

V. Calculating op-code n-grams TF-IDF (term frequency-inverse document frequency), op-code n-gram frequencies and op-code n-grams binary occurrence matrix.

VI. Training and testing various classifiers with feature sets generated in the previous step.

Our experiment shows the op-code 2-grams TF-IDF values are better inputs for the classifiers. Typically, the TF-IDF weight is composed of two terms [23]: the first computes the normalized Term Frequency (TF), also known as the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. We have calculated TF-IDF for op-codes as following:

TF(t) = (Number of times Opcode "t" appears in a asm file)/(Total Number of opcode "t" in the asm file)

IDF(t) = [log.sub.g](Total Number of Binaries)/(Number of binaries with opcode t in them)

Using TF-IDF values, the Naive Bayes classifier (which is a family of a simple probabilistic classifier based on applying the Bayes theorem with strong (naive) independence assumptions between the features) yielded the best accuracy, and the k-Nearest Neighbors (K-NN) classifier [24] was the second best. K-NN is a non-parametric method used for classification. For this algorithm, the input involves the k closest training examples in the feature space. The output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors [25]. In the experiment I set K=3, and I used Cosine Similarity as distance measurement. I examined several other classifiers such as different Decision Tree algorithms and SVM, but they could not produce better results.

Table 5 summarizes the Naive Bays classifier results (more detailed results including the confusion matrix is available in Appendix C).

4.6.3 Investigating the API function calls

An API function in Windows operating system (also known as WinAPI), is Microsoft's core set of application programming interfaces. The name Windows API collectively refers to some different platform implementations that are often referred to by their names (for instance, Win32 API). Almost all Windows programs need some API calls to run correctly.

In this research, I have utilized Cuckoo sandbox (1) which is a free and open source toolkit provided by the Cuckoo Foundation, to trace the WinAPI calls of the binaries. This kit is used to execute and analyze multiple binaries automatically, and it can collect comprehensive information including traces of win32 API calls invoked by all processes spawned by the samples that outline what the binary does while running inside an isolated virtual Windows operating system.

For this phase, I installed the Cuckoo sandbox on Ubuntu Linux as the host operating system, and I used Windows XP (with .NET libraries installed) in VirtualBox environment as the guest operating system. Figure 8, shows how Cuckoo works.

After collecting API calls and comparing them for different authors, my observations showed that there is no significant difference between API calls among authors. In other words, most of the binaries in my dataset (collected from Google Code Jam competition) called more or less same set of limited API functions typically for I/O operations, file creation and arithmetic calculations. This outcome didn't surprise me due to the nature of Google Code Jam contests.

In short, for my dataset, API function calls could not fulfill the second eligibility criteria mentioned in the previous sections and hence cannot be used for author classification purpose.

4.6.4 Examination of features derived from binaries Control Flow Graph (CFG)

Many studies have been carried on to show graph theory and complex networks can represent structure and behavior of programs [26]. A CFG is an illustration (using graph notation), of all possible paths that might be traversed during program execution. Every node in the CFG signifies a block of code or function without any jumps. Directed edges typically represent jumps in the control flow.

I created the CFG of the binaries before and after applying different obfuscation techniques using IDA Pro software, and I extracted various features from the subsequent normal and transformed CFGs, and finally, I compared the values of attributes to see if they are obfuscation resistant. The extracted features in the study can be viewed in three categories according to information about nodes, edges, and subgraphs which can reflect the complexity of graphs. Table 6, summarizes the extracted features from CFG.

From the experiment results, I inferred that none of the abovementioned features are obfuscation resistant and their values changed significantly after performing obfuscation techniques. I repeated the experiment for all authors and got same results. In brief, features derived from CFG could not fulfill the first defined eligibility criteria and, therefore, they are not suitable for further classification.

5 CONCLUSION AND FUTURE DIRECTIONS

Obfuscation methods make codes hard to understand and binaries hard to reverse engineer. Programmers mostly employ obfuscators to hide their programming stylometry and to conceal the program's internal logic. They do this mainly to protect their codes or binaries from tampering and detection. Use of obfuscators could complicate the reverse engineering and authorship attribution of binaries.

In this research, I have examined some features at binary level to see if they are obfuscation resistant. At least for the data set that I used in the experiment, I could identify a few features such as op-code frequencies and op-code n-grams as obfuscation resistant. My assessments revealed that from these features some unique patterns for each programmer can be extracted. The primary outcome of this study is the fact that use of packers and obfuscators still leaves some footprints in binary files which can be used to identify the authors.

Since my dataset was limited to some source codes that I obtained from the Google code jam contests, I believe this experiment must be repeated with different (and larger) datasets.

The study may further be extended by applying pattern recognition techniques to the same features or other features which can be obtained from .NET CIL, CFG, etc.) using sequential pattern recognition algorithms such as the Hidden Markov Model (HMM) to find unique patterns for individual programmers. For each author, frequent op-codes (and API calls) and their association rules can be explored too.

6 ACKNOWLEDGEMENT

This research was supported by Canadian Institute for Cyber Security, University of New Brunswick (during my post-doctoral fellowship in 2015-2016). I would like to thank researchers and faculty members who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper.

7 REFERENCES

[1] "List of obfuscators for .NET,". [Online]. Available: http://www.theswamp.org/index.php?topic=41844.0 (accessed September 22, 2017).

[2] Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. Oba2: an onion approach to binary code authorship attribution. Digital Investigation, 11, 2014.

[3] Caliskan-Islam, A., Yamaguchi, F., Dauber, E., Harang, R., Rieck, K., Greenstadt, R., & Narayanan, A. (2015). When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546.

[4] E. H. Spafford, "The Internet worm program: An analysis," ACM SIGCOMM Computer Communication Review, pp. 17-57, 1989.

[5] E. H. a. W. S. A. Spafford, "Software forensics: Can we track code to its authors?," Elsevier Computers & Security, pp. 585-595, 1993.

[6] S. U. A. L. a. T. A. Burrows, "Comparing techniques for authorship attribution of source code," Software-Practice and experience, Wiley, pp. 1-32, 2014.

[7] I. a. S. E. H. Krsul, "Authorship analysis: Identifying the author of a program," Elsevier Jornal of Computers & Security, pp. 233-257, 1997.

[8] S. G. a. G. A. R. a. M. G. a. S. P. J. MacDonell, "Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis," in 6th International Conference on Neural Information Processing, 1999.

[9] H. a. S. M. H. Ding, "Extraction of Java program fingerprints for software authorship identification," Elsevier Journal of Systems and Software, pp. 49-57, 2004.

[10] P. a. A. A. a. M. S. Sallis, "Software forensics: old methods for a new science," in International Conference on Software Engineering: Education and Practice. , 1996.

[11] L. M. Ottenstein, "Quantitative estimates of debugging requirements," IEEE Transactions on Software Engineering, pp. 504-514, 1979.

[12] S. a. T. S. M. Burrows, "Source code authorship attribution using n-grams," in Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University, 2007.

[13] A. Caliskan-Islam, R. Harang and A. Liu, "Deanonymizing Programmers via Code Stylometry, The Role of Stylometry in Privacy," in Source code and cross-domain authorship attribution, Hamburg, Germany, 2014.

[14] X. Z. B. P. M. Nathan Rosenblum, "Who Wrote This Code? Identifying the Authors of Program Binaries," in Computer Security--ESORICS 2011 - 16th European Symposium on Research in Computer Security, Leuven, Belgium, Springer, 2011, pp. 172-189.

[15] J. a. S. M. a. S. E. a. M. S. Kothari, "A probabilistic approach to source code authorship identification," in Fourth International Conference on Information Technology, 2007.

[16] G. a. S. E. a. G. S. Frantzeskou, "Supporting the cybercrime investigation process: effective discrimination of source code authors based on byte-level information," E-business and Telecommunication Networks, p. Springer, 2005.

[17] "C#," 1 10 2015. [Online]. Available: http://www.gohero.net/jam/13/languages/C%23.

[18] "Comparison of .NET obfuscators," 06 09 2017. [Online]. Available: http://www.wow.com/wiki/List_of_obfuscators_for_.NET (accessed September 22, 2017).

[19] YCK1509, "ConfuserEx, open source .net obfuscator," 04 11 2015. [Online]. Available: http://yck1509.github.io/ConfuserEx/.

[20] "x86 instruction listings," 04 11 2015. [Online]. Available: https://en.wikipedia.org/wiki/X86_instruction_listings. (accessed September 22, 2017).

[21] T. K. Ho, "Random Decision Forests," in Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 1995.

[22] S. Hanna, L. Huang, E. Wu, S. Li, C. Chen and D. Song, "Juxtapp: a scalable system for detecting code reuse among android applications," DIMVA'12 Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 62-81, 2012.

[23] "Tf-idf: A Single-Page Tutorial," 3 12 2015. [Online]. Available: http://www.tfidf.com/ (accessed September 22, 2017)..

[24] N. S. Altman, "An introduction to kernel and nearest-neighbor nonparametric regression," The American Statisticians, pp. 175-185, 1992.

[25] Tripathy, Mayank, and Deepak Shrivastava. 2015.

"Designing and Implementation of an Efficient Fingerprint Recognition System Using Minutia Feature and KNN Classifier." International Journal of Science, Engineering and Computer Technology 5 (6). Indian Association of Health, Research, and Welfare: 166.

[26] Chatzigeorgiou, N. Tsantalis, and G. Stephanide, "Application of Graph Theory to OO software Engineering," in International workshop on interdisciplinary software engineering research, Shanghai, 2006.

[27] K. A. a. M. B. P. Roundy, "Binary-Code Obfuscations in Prevalent Packer Tools," ACM Computing Surveys (CSUR), p. 4, 2013.

<http://ftp.cs.wisc.edu/paradyn/papers/Roundy12Packers.pdf>.

[28] Crypto Obfuscator For .net - Obfuscator With Code, http://www.ssware.com/cryptoobfuscator/features.htm (accessed September 22, 2017).

[29] ".net Obfuscator Features - Preemptive Solutions."https://www.preemptive.com/products/dotfuscator/features (accessed September 22, 2017)..

[30] Cfg - Wikipedia, https://en.wikipedia.org/wiki/CFG (accessed September 22, 2017).

[31] .net Reactor Online Help - Welcome To .net Reactor, http://www.eziriz.com/help/source/main_panel.html (accessed September 22, 2017)

Kamran Morovati (kamran.morovati@snb.ca)

Cyber Security Analyst - SNB, Government of New Brunswick

(1) Example taken from http://www.ssware.com/cryptoobfuscator/features.htm

(1) http://www.red-gate.com/products/dotnet-development/reflector/

(2) http://ilspy.net/

(3) http://upx.sourceforge.net/

(1) Image credits to "The Rootkit Arsenal" by Bill Blunden

(1) https://www.cuckoosandbox.org/

Appendix A: Tukey HSD and Fisher's LSD tests results.
Tukey HSD test result for Author 1

Dependent Variable  Author (i)  Author(j)  Mean diff (I,J)  Std. error

conv.i  Tukey HSD   Author 1    Author2    -.000878         .0034923
                                Author3    -.000895         .0030244
                                Author4    -.000485         .0030799
                                Author5    -.000529         .0031236
                                Author6    -.000649         .0030799
                                Author7    -.001015         .0030265
                                Author8    -.000624         .0031236
                                Author9    -.000729         .0030483
                                Author10   -.000800         .0030799

Dependent Variable  Sig    Lower Bound  Upper Bound

conv.i  Tukey HSD   1.000  -.011974     .010218
                    1.000  -.010504     .008715
                    1.000  -.010271     .009301
                    1.000  -.010453     .009396
                    1.000  -.010435     .009137
                    1.000  -.010631     .008601
                    1.000  -.010549     .009300
                    1.000  -.010414     .008957
                    1.000  -.010585     .008986

LSD test results for programmer Author 1 ( "conv.i" )

Author (i)     Author(j)  Mean diff (I,J)  Std. error  Sig

LSD  Author 1  Author2    -.000878         .0034923    .802
               Author3    -.000895         .0030244    .768
               Author4    -.000485         .0030799    .875
               Author5    -.000529         .0031236    .866
               Author6    -.000649         .0030799    .833
               Author7    -.001015         .0030265    .737
               Author8    -.000624         .0031236    .842
               Author9    -.000729         .0030483    .811
               Author10   -.000800         .0030799    .795

Author (i)     Lower Bound  Upper Bound

LSD  Author 1  -.007739     .005982
               -.006836     .005047
               -.006535     .005565
               -.006665     .005607
               -.006699     .005401
               -.006961     .004930
               -.006761     .005512
               -.006717     .005260
               -.006850     .005251


Appendix B: Confusion Matrix and Classification results using Random Forests classifier with Bagging technique
Confusion Matrix

a  b  c   d   e   f   g   h   i   j   <---classified as

9  0   0   0   1   0   0   0   0   0  a=Author 1
0  9   0   1   0   0   0  10   0   0  b=Author2
0  0  70   0   0   0   0   0   0  10  c=Author3
0  0   0  60   0   0   0   0   0   0  d=Author4
0  0   0   0  50   0   0   0   0   0  e=Author5
0  0   0   0   0  59   0   0   1   0  f=Author6
0  0   0   0   0   0  79   0   0   0  g=Author7
0  9   0   0   0   0   0  41   0   0  h=Author8
0  0   2   0   0   0   0   0  68   0  i=Author9
0  0   8   0   0   0   0   0   0  52  j=Author10

Detailed Accuracy by Class

          TP     FP     precision  Recall  F-       MCC    ROC    PRC
          Rate   Rate                      measure         Area   Area

          0.9    0      1          0.9     0.947    0.948  1      1
          0.45   0.017  0.5        0.45    0.474    0.455  0.978  0.474
          0.875  0.022  0.875      0.875   0.875    0.853  0.99   0.957
          1      0.002  0.984      1       0.992    0.991  1      0.999
          1      0.002  0.98       1       0.99     0.989  1      0.999
          0.983  0      1          0.983   0.992    0.991  1      1
          1      0      1          1       1        1      1      0.998
          0.82   0.02   0.804      0.82    0.812    0.792  0.989  0.882
          0.971  0.002  0.986      0.971   0.978    0.975  0.999  0.997
          0.867  0.021  0.839      0.867   0.852    0.834  0.988  0.905
Weighted  0.922  0.009  0.921      0.922   0.921    0.913  0.995  0.952
Avg.

          Class


          Author 1
          Author2
          Author3
          Author4
          Author5
          Author6
          Author7
          Author8
          Author9
          Author10
Weighted
Avg.


Appendix C: Confusion Matrix, Class Precision and Class Recall values for Naive Bayes Classifier
              true      true     true     true     true     true
              Author 1  Author2  Author3  Author4  Author5  Author6

pred.
Author 1       9         0        0         0        0        0
pred.
Author2        0        19        0         0        0        0
pred.
Author3        0         0       71         0        0        0
pred.
Author4        0         0        0        60        0        0
pred.
Author5        0         0        0         0       50        0
pred.
Author6        0         0        0         0        0       60
pred.
Author7        0         0        0         0        0        0
pred.
Author8        0         1        1         0        0        0
pred.
Author9        0         0        0         0        0        0
pred.
Author10       0         0        8         0        0        0
class recall  90.00%    95.00%   88.75%   100.00%  100.00%  100.00%

              true     true     true     true      class
              Author7  Author8  Author9  Author10  precision

pred.
Author 1        0       0         0       0        100.00%
pred.
Author2         0      10         0       0         65.52%
pred.
Author3         0       0         0       1         98.61%
pred.
Author4         0       0         0       0        100.00%
pred.
Author5         0       0         0       0         98.04%
pred.
Author6         0       0         0       0        100.00%
pred.
Author7         0      79         0       0        100.00%
pred.
Author8         0      40         0       1         88.89%
pred.
Author9         0       0        70       0         93.02%
pred.
Author10        0       0         0      58         87.88%
class recall  100.00%  80.00%   100.00%  96.67%

Table 1: Comparing the standard and obfuscated binaries to test the
performance of the ConfuserEx obfuscator

                      Original   Obfuscated  Op-code#  Op-code#
File 1  File 2        file size  file size   (Normal)  (obfuscated)

        Control
Normal  Flow Graph    6 KB       44 KB       202       6596
Binary  Obfuscated
        Reference-
Normal  Proxy         6 KB       43 KB       202       6484
Binary  Obfuscated
Normal  Anti-Debug    6 KB       50 KB       202       8369
Binary  Obfuscated
        Variable
Normal  Protection    6 KB       44 KB       202       6520
Binary  Obfuscated
Normal  Invalid Meta
Binary  Obfuscated
Normal  Anti-Dump
Binary  Obfuscated    6 KB       50 KB       202       8369
        Anti
        ILDASM        6 KB       43 KB       202       6482
Binary  Obfuscated
Normal  Name
        Protection    6 KB       44 KB       202       6583
Binary  Obfuscated
        Resource
        Protection    6 KB       44 KB       202       6578
        Obfuscated

Table 2: Pre ANOVA test for LSD test

         ANOVA test
         Sum of      DF    Mean    F     Sig
         Squares           Square

Between  0.001          9  0.000   0.01  1.000
Groups
Within
Groups   8.800       1300   .007
Total    8.801       1309

Table 3: Fisher's LSD test Results

                     Multiple Comparisons
                      Dependent Variable
LSD
          Mean                        95% Confidence
(I)  (J)  Difference  Std.      Sig.  Interval
ID   ID               Error           Lower     Upper
          (I-J)                       Bound     Bound

1    2    .0019595    .0101660  .847  -.017984  .0219
     3    .0020931    .0101660  .837  -.017850  .0220
     4    .0020084    .0101660  .843  -.017935  .0219
     5    .0024641    .0101660  .809  -.017479  .0224
     6    .0026634    .0101660  .793  -.017280  .0226
     7    .0020122    .0101660  .843  -.017931  .0219
     8    .0019779    .0101660  .846  -.017966  .0219
     9    .0022145    .0101660  .828  -.017729  .0221
     10   .0020863    .0101660  .837  -.017857  .0220

Table 4: RF + Bagging Classification Summary

Correctly Classified Instances      497 (92.21%)
Incorrectly Classified Instances     42 (7.79%)
Kappa statistic                       0.9118
Mean absolute error                   0.053
Root mean squared error               0.1284
Relative absolute error              29.9741%
Root relative squared error          43.2081%
Coverage of cases (0.95 level)      100%
Mean rel. region size (0.95 level)   42.5046%
Total Number of Instances           539

Table 5: Naive Bayes Classification Summary

Number of Inputs op-code n-grams  54726

Accuracy                          95.73%
Kappa                              0.952
Avg. Class Recall                 95.42%
Avg. Class Precision              93.02%

Table 6: Features extracted from CFG

Feature       Description

#Nodes        Number of nodes in the graph
#Terminal     Number of nodes with
Nodes         incoming edges
#Isolated     Number of nodes with no
Nodes         incoming or outgoing edges
#Edges        Number of edges in the graph
Average Node  Average number of outgoing
Degree        edges
Max Node      Maximum number of outgoing
Degree        edges
              Number of subgraphs in the
# Sub-Graphs  CFG
COPYRIGHT 2017 The Society of Digital Information and Wireless Communications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Morovati, Kamran
Publication:International Journal of Cyber-Security and Digital Forensics
Article Type:Report
Date:Sep 1, 2017
Words:7982
Previous Article:Promoting cybersecurity awareness and resilience approaches, capabilities and actions plans against cybercrimes and frauds in Africa.
Next Article:A New Method of Image Steganography: StegBlender.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |