Printer Friendly

Detection of malicious patterns using effective sequence mining algorithm.


Data mining is the process of extracting the useful information from large quantities of data stored in the database. The data mining techniques can be applied in malware detection due to the rapid development of the information technology. Malware is a term for malicious software used to interrupt the computer operations and collect the sensitive information, it also gains access the private computer systems and it displays the unnecessary advertising. It continues to grow in huge volume and evolve in the complexity. Nowadays the causes of malicious attacks affects the more organizations in which number of websites spread over the malware it increases the alarming rate and it performs out of the control. Once the malicious software enters into system it scans the vulnerabilities of operating system and it performs the unintended actions, then it slows down the performance of the system. The categories of the malware include Trojans, root kits, worms, spyware ... etc. and includes detection techniques such as signature based detection [2], heuristic based detection, a specification based detection used to detect or identify the malware.

One of the most identified actions of malware occurs a change in how the user's computer is processing. It might affects the system performance, causes start-up problems with the user's computer, unexpectedly closes the browser or it takes you to unwanted pages and pop-up window appears when the browser is not open then additional toolbars are added to the browser.

Malware detection methods are classified into different classes based on the type of malware and the techniques used for the analysis and detection of the malware damages that occur in the systems includes:

Loss of data:

Many viruses and Trojans try to delete files or clean the hard disk when it affected, even if the infection are known, have to delete the affected files

Theft of account:

Many types of malware include key logger functions that mainly designed to steal the account passwords from their targets


It establishes control over the user's computer and hackers build the networks of these computers processing power for tasks like cracking file passwords or sending out bulk emails.

Financial losses:

Hacker can access the credit card or bank account through key logger that can drain the account and uses the information to run up charges.

The remainder of this paper is organized as follows: Section 2 describes about the Background. In Section 3, explains about the proposed system. Section 4 shows the analysis result and in section 5 gives the conclusion.


In recent years, almost everything has changed in the field of malware and malware analysis. From the malware it creates a proof of some security model and the malware created for financial gain to malware formed to damage infrastructure. There are mainly three techniques for malware detection: Signature based, Heuristic based and Specification based techniques.

2.1 Signature Based Detection:

Signature based detection [2] is a pattern matching or string. It maintains the database of the signature and detects the malware by matching the patterns against the database. Most of the antivirus tools are based on signature based detection technique and to protect the legitimate users from the attacks. The main disadvantage of this method was fails to recognize the new or unseen malicious executables and not effective to detect zero day attack.

2.2 Heuristic Based Detection:

It is also called as proactive technique. The main goal of this technique is to analyze the behaviour of the known or unknown malwares. Behavioural parameters includes source and destination address of the malware and the types of the attachments and also the measurable statistical features. The advantage of this detection is able to detect the new form of virus but that had not yet discovered.

2.3 Specification Based Detection:

This technique is similar to anomaly based detection. Specification based detection based technique is truly dependent on program specification that describes the behaviour of security critical programs. It monitors the execution of the program includes and detects the deviation of behaviour from the given specification rather than detecting the occurrence of patterns

3. Proposed System:

The proposed system describes an effective method to detect the malware in the system. The unknown malwares are found with the help of mining algorithm and it reduces the execution time by using support vector machine classifier. To detect the malware it uses three major components of the detection system.

3.1 System Architechture:

The system architecture, explains about the proposed framework for detection of malware. It includes four major components of the detection: Dataset acquisition, instruction sequence extractor, malicious sequential pattern miner and SVM (support vector machine) classifier and Random Forest classifier to predict the malware.

3.1.1 Dataset Acquisition:

In this module, upload the dataset and the data set contains collection of malware attributes. A malware database is a repository containing malicious and benign files [1]. The dataset is in Arff format consists of browsing history with 71 attributes such as number of redirections, upload logs, history etc.. The key uses of a malware database are to store the measurement data, manage a searchable index, and make the data available to other applications for analysis and interpretation.

3.1.2 Instruction Sequence Extractor:

MSPMD [2] first extracts instructions from training samples and transforms them in to a group of 32-bit global IDs based on their lexicographical order. Then, a subset of instructions is selected using the newly proposed algorithm MIE (Malicious Instruction Extraction), followed by the guiding match method used to generate instruction sequence for each training sample se-quences are extracted from the PE (PortableExecutable) files as the preliminary features, based on which the malicious sequential patterns are mined in the next step. The extracted in-struction sequences can well indicate the potential malicious patterns at the microlevel. In addition, such kind of features can be easily extracted and used to generate signatures for the traditional malware detection systems..

3.1.3 Malicious Sequential Pattern Miner:

In This Component, MSPE [2] (Malicious Sequential Pattern Extraction) Algorithm is Applied to Mine Discriminating Malicious Sequential Patterns from Instruction Sequences.MSPE introduces the concept of objective-oriented to learn patterns with strong abilities to distinguish malware from benign files. Here we designed a filtering criterion in MSPE to filter the redundant patterns in the mining process in order to reduce the costs of processing time and search space. This strategy greatly enhances the efficiency of our algorithm.

MSPE Algorithm:

Step 1. Scans [S.sub.M] and compute the support and confident for each Item to generate length-1 sequential patterns, denote as [L.sub.t]. Step 2. Set the length of pattern=2. Step 3. Generate new set of candidate Cn by self-join and prune Operation of the sequential patterns found in the (n-1) th Pass: 1. Self-join operation: Join Ln-1 with itself to generate [C.sub.n] Based on the following criterion h and f are sequential Patterns in [l.sub.n-1], if [l.sub.1] with removal of the first item equals To [l.sub.2] with removal of the last item, we join [l.sub.2] to [l.sub.1]. 2. Prune operation: Remove candidate from [C.sub.n] if one of it Length (n-1) subsequence is not a sequential pattern Found at [L.sub.n-1] Step 4. Scan [C.sub.n] an collect the support and confidence for each C [C.sub.n] to find the new set of sequential patterns [l.sub.n] C' are all length (n-1) subsequence of c [C.sub.n]. Step 5. n = n+1. Step 6. Repeat Steps 3-5 until no subsequence patterns is found in a Pass or no candidate sequence is generated. Step 7. Collect malicious sequential patterns from the resulting sequential Pattern based on malicious sequential pattern

3.1.4 SVM Classifier:

Support vector machines (SVM) are a set of supervised learning method used for classification. SVM is a learning procedure based on the statistical learning theory and it is one of the best machine learning techniques used in data mining. It is mainly used for binary classification. It uses kernel function, which acts upon the input data; final summation with an activation function gives the final classification result. Support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification. It efficiently gives the accuracy and it runs efficiently on large databases. We can predict the output from the given train set. SVM algorithm is faster algorithm to build the predictive model. It uses a subset of training samples in the decision function (called support vectors) so it is also memory efficient. Still it is effective in cases where number of dimensions is greater than the number of samples.

3.1.5 Random Forest Classifier:

A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable

Advantages of Random Forest:

* It is one a highly accurate classifier.

* It runs efficiently on large databases.

* It can handle thousands of input variables without variable deletion.

* It has an effective method for estimating missing data


In this work, the objective is to detect malware, thus the MSPE algorithm tends to find patterns to support this specific objective. Different from other instruction features used in the experiment above, these discriminating patterns captures the notable difference between the classifiers.

4.1 Performance of SVM Classifier:

SVM classification is performed using the given dataset in which the instances are classified using the weka tool. The SVM algorithm results with 95% accuracy of correctly classified instances and 4% incorrectly classified instances with true positive rate and false positive rate.

The detailed accuracy shown using the classification in the weka tool: True positive (TP): the number of malicious executables correctly classified True negative (TN): the number of benign executables correctly classified False Positive (FP): the number of benign executables classified as malicious code False negative (FN): the number of malicious executables classified as benign code False positive Rate=FP/ (FP+TN) Accuracy=TP+TN/ (TP+TN+FP+FN) Detection Rate=TP/ (TP+FN)

4.2 Performances of Random Forest Classifier:

Random Forest classification performs using class to provide the tree and it evaluates the detailed accuracy of the classifier with correctly classified instances with 98% and incorrectly classified instances 2%

The comparison of the Svm classifier and the j48 classifier build time model using the sample files in the weka tool


In this paper, we used an Instruction sequence extractor, a data mining based detection frame work called malicious sequential pattern mining in which SVM and Random Forest classifier for classification. With the help of Instruction sequence extractor the features are extracted for given sample file and with that feature extracted value we mined the samples using MSPE algorithm to discover the malicious natured files.

In this paper, we only detect the malware in the best way. This will lead us to continue working on the framework in the future, by combining some strategies such as data reduction in order to enhance the classification efficiency.


[1.] Narouei, M., M. Ahmadi, G. Giacinto, H. Takabi and A. Sami, 2015. DLL Miner: Structural mining for malware detection. Security and Communication Networks, 8

[2.] Yujie Fan, Yan fang Ye, Lifei Chen, 2016. "Malicious sequential pattern mining for automatic malware detection". Journal in Expert Systems with Applications, pp: 16-25.

[3.] Yanfang Ye, Tao Li, Yong Chen, Qingshan Jiang, 2010. Automatic malware categorization using cluster ensemble. In Proceedings of the 16th international conference on knowledge discovery and data mining, pp: 95-104.

[4.] Kent Griffin, Scott Schneider, Xin Hu, Tzi-cker Chiueh .Automatic generation of string signatures for malware detection. In Proceedings of the 12th international symposium on recent advances in intrusion detection, pp: 101-120.

[5.] Babak Bashari Rad, Maslin Masrom, Suahimi Ibrahim. Opcodes histogram for classifying meta- morphic portable executables malware. In Proceedings of international conference one-learning and e-technologies in education, pp: 209-213.

[6.] Mansour Ahmadi, Dmitry Ulyanov, Stanislav Semenov, 2016. Mikhail Trofimov. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. arXiv:1511.04317v2 [cs.CR].

[7.] Tobias Wuchner, Martin ochoa, Alexendar Pretschner, 2014. Malware detection with quantitative dataflow graphs. In Proceedings of the 9th ACM symposium on information, computer and communications security, pp: 271-282.

[8.] Mansour Ahamadi, Sami A., H. Rahimi and B. Yadegari, 2013. Malware detection by be-havioural sequential patterns. Computer Fraud & Security, pp: 11-19.

[9.] Santos, F., J. Brezo, Y.K. Nieves, B. Penya, Sanz, C. Laorden and P.G. Bringas, 2010. "Opcode-sequence-based malware detection," in Proc.2nd Int. Symp. Eng. Secure Software and Syst. (ESSoS), Pisa, Italy, LNCS 5965, pp: 35-43.

[10.] Sekar, R., M. Bendre, D. Bollineni and Bollineni, R. Needham and M. Abadi, Eds., 2001. "A fast automaton-based method for detecting anomalous program behaviors," in Proc. 2001 IEEE Symp. Security and Privacy, IEEE Comput. Soc., Los Alamitos, cA, USA, pp: 144-155.

[11.] Bilar, D., 2007. "Opcodes as predictor for malware, Int. J. Electron. Security Digital Forensics, 1(2): 156 168, D. Bilar, "Call graph properties of executables and generative mechanisms, AI Commun., Special Issue on Network Anal. in Natural Sci. and Eng., 20(4): 231-243.

[12.] Schultz, M.G., E.E skin, E. Zadok and S.J. Stolfo, 2001. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE symposium on security and privacy: 36: 38-49.

[13.] Schultz, M.G., E. Eskin, E. Zadok, S.J. Stolfo, 2001. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE symposium on security and privacy: 36: 38-49.

[14.] Cesare, S., Y. Xiang, 2011. Malware variant detection using similarity search over sets of control flow graphs. In: TrustCom, pp: 181e9.

[15.] Song, F., T. Touili, 2012a. Efficient malware detection using model checking In: Giannakopoulou D, Mry D, editors. FM: Formal Methods. Vol. 7436 of Lecture Notes in Computer Science. Berlin Heidelberg: Springer; pp: 418e33.

[16.] Borello, J-M., L. Me, 2008. Code obfuscation techniques for metamorphic viruses. J Comput Virol 4(3): 211e20.

[17.] Bruschi, D., L. Martignoni, M. Monga, 2006. Detecting self-mutating malware using control-flow graph matching. In: DIMVA. Berlin, Heidelberg: Springer-Verlag; pp: 129e43.

[18.] Canfora, G., A. Iannaccone, C. Visaggio, 2014. Static analysis for the detection of metamorphic computer viruses using repeated instructions counting heuristics. J ComputVirol Hacking Tech, 10(1): 11e27.

[19.] Qiao, Y., Y. Yang, J. He, C. Tang and Z. Liu, 2014. CBM: Free, automatic malware analysis framework using API call sequences. In Knowledge engineering and management, pp: 225-236.

[20.] Anju, S.S., P. Harmya, N. Jagadeesh, R. Darsana, 2010. Malware detection using assembly code and control flow graph optimization. In: A2CWiC, 2010. New York, NY, USA: ACM; 65: 1e65:4.

[21.] Alam, S., R.N. Horspool, 2013. Traore I. MAIL: malware analysis intermediate language e a step towards automating and optimizing Information and Networks. SIN'13. New York, NY, USA: ACM SIGSAC.

(1) Mohana priya. M and (2) Reenadevi. R

(1) Pg Schloar -Computer science and engineering Sona college of technology.

(2) Assistant professor-Computer science and engineering Sona college of technology.

Received 28 January 2017; Accepted 22 May 2017; Available online 28 May 2017

Address For Correspondence:

Mohana priya. M, Pg Schloar -Computer science and engineering Sona college of technology E-mail:

Caption: Fig. 1: Detection of malware architecture

Caption: Fig. 2: SVM classifier result

Caption: Fig. 4 : Random Forest classifier result
Table 1: Evaluation of algorithm

Algorithm   Execution time   Accuracy

SVM         4.72s            95
RF          0.27             98

Fig. 3: Accuracy of SVM and
Random Forest classifier


ANN      97
SVM      95
RF       98

Note: Table made from bar graph.
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Priya M., Mohana; Reenadevi, R.
Publication:Advances in Natural and Applied Sciences
Date:May 1, 2017
Previous Article:ACM based segmentation of pulmonary lung nodule in ct images.
Next Article:Dynamic business rule engine.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |