Implementation and Analysis of Combined Machine Learning Method for Intrusion Detection System.
1 IntroductionIn the last few decades, information and computer technology with interconnected internet services has become increasingly important in everyday human life. Various ways of network security have been researched and developed to protect information assets and computer network infrastructure. One technique that is often used in network security is Intrusion Detection System (IDS) [1]. Currently, detection mechanism in IDS is divided into two types, i.e., misuse and anomaly detection [2]. Misuse detection is a technique of detecting attacks based on a familiar signature pattern. Whereas anomaly detection is a technique to detect attacks based on anomaly conditions that occur in the network compared to the normal state of the network conditions that have been determined [3]. One technique used in anomaly detection is the use of Machine Learning method, in recognizing network behavior.
The most commonly used conventional IDS today is signature-based (misuse detection). This type of IDS requires much human intervention ranging from identifying attacks, creating signature attacks, and storing these signatures into the database so that it can be used to detect such attacks in the future. As more and more types of new attacks appear on the internet the heavier the human task to keep the system always to be able to detect new types of attacks, which must always update the database with new types of attacks that appear. This results in zero-day conditions, where a new type of attack cannot be detected and infiltrated into the internal network and result in damage to the system [3]. This incident can occur because the new attack signature has not been stored in the database so that new attacks coming into the network are not recognized.
The tasks that human must perform in maintaining the novelty of this signatures database can be achieved more efficiently by applying machine-learning approach. With the machine-learning approach, we only need to provide a set of network data with various conditions suitable for the processed system with a machine-learning algorithm so that found a model that can be used to recognize normal and abnormal network traffic. Furthermore, this machine-learning model that has been built can be used to detect attacks more efficiently and reliably, without having to update the attack database often.
Several studies were conducted to develop machine learning based IDS. Most of the studies were conducted using publicly available datasets such as KDDCUP or NSL-KDD dataset. The use of such datasets for IDS development can result in good machine learning models, but often cannot be implemented on real networks. Some features of these datasets cannot be simply extracted from real network traffic. In this study, we performed an analysis to select features of the KDD dataset that were most likely to be used in developing machine-learning based IDS.
Although there have been many academic studies that reveal the potential of anomaly detection approach better in detecting new types of attacks, until now the widely used commercial/industrial IDS products are signature based (misuse detection). There are still some obstacles in the implementation of anomaly detection technique with machine learning method into the real network, that is a wrong selection of data features and inappropriate method usage [4].
Increasing number of attacks that threaten data security on internet encourage the use of encryption in data communications on the internet. As more and more human needs are turning to digital, a large number of digital services and applications use encryption as the primary method of securing data communications on the internet [5]. Available data show that encrypted Internet traffic increased by 90% per year [6]. NSS Labs predicts that by 2019, 75% of internet traffic will be encrypted [7].
More and more internet services are heavily dependent on encryption mechanisms to ensure their safety. Even encryption technology is also used by Internet crime actors to avoid detection and to secure their malicious activities. The large use of encryption technology in data communications on the internet has resulted in increasingly difficult network security monitoring work. The attack detection system can no longer assume that all data packets on the network can be extracted and investigated easily to detect anomaly within the network traffic [8,9].
Much research has been conducted to develop machine learning based IDS. Most of the researchers were conducted using a generally available dataset of KDDCUP dataset. The use of such datasets for IDS development can result in good machine-learning models with high accuracy, but often cannot be implemented on real networks. Some features of this KDD dataset cannot be easily extracted from actual network traffic. In this study, we analyze to select features of the KDD dataset that are most likely to be used in developing machine learning based IDS and can be implemented in real networks.
The paper is structured as follows: Works related to our research are reviewed in Section 2. Section 3 introduces the methodology of this research in developing machine-learning model based IDS. Experiment and analysis are presented in Section 4. Section 5, concludes this paper.
2 Related Works
In recent year researchers have carried out many researches to develop anomaly-based IDS system using machine learning approach. In contrast to misuse-based intrusion detection that checks for signatures contained in the network packet header, anomaly-based IDS extracts the network data packets to obtain network features/attributes that can be used to detect attacks using machine learning approach. Several machine learning algorithms have been proposed by many researchers. To develop a good machine learning method, some techniques and mechanism need to be implemented in a combination, such as the use of feature and model selection as well as parameter tuning to get the optimal result.
A new method that combines several methods for detecting attacks is proposed by J. Lekha, and Padmavathi Ganapathi. [10] In their research they used improved CART technique for misuse detection and extreme learning machine (ELM) algorithm for anomaly detection. In the misuse detection model, the traffic pattern is classified into the known and unknown attack. Anomaly detection model classifies the not known attack as normal data set and unknown attack to improve the performance of normal traffic behavior. From the experimental results, the method offered by using the NSL-KDD dataset shows that the hybrid intrusion detection method proposed can improve performance in detecting attacks regarding training time, testing time, false positive ratio and detection ratio. The proposed method detects the known attacks and unknown attacks with a ratio of 99.8 % and 52% respectively.
Mohammed A. Ambusaidi et al. proposed the use of Flexible Mutual Information (FMI) based algorithms to perform feature selection capable of handling linear and non-linear dependent features. The selection features are then applied to the Least Square Support Vector Machine (LSSVM) based attack detection method. To test the proposed technique several different datasets namely KDD Cup 99, NSL-KDD and Kyoto 2006+ dataset is used. From the experiment results obtained the best accuracy results using KDDCUP 99 dataset with 21 features selected [11].
Chaouki Khammassi, Saoussen Krichen applies the wrapper approach to selecting the best feature portion of the genetic algorithm as a search strategy and logistic regression as a learning algorithm. The selection features were then processed using the three decision tree classifiers, i.e., C4.5, Random Forest (RF), and Naive Bayes Tree (NBTree), and the dataset used was 10% KDD99 datasets and the UNSW-NB15 dataset. The experimental results obtained were 99.90%, 99.8%, and 0.105% false alarm rate (FAR) with an 18 features subset on the KDD99 dataset. The results obtained for UNSW-NB15 provide the lowest FAR with 6.39% with a subset of 20 features [12].
M.R. Gauthama Raman et al. proposed the use of Hypergraph based Genetic Algorithm (HG-GA) for parameter setting and feature selection in the Support Vector Machine (SVM). The HG-GA-based algorithm is used to accelerate the search for optimal global solutions and used weighted objective functions to maintain a trade-off between maximizing detection rates and minimizing false alarm rates, along with the optimal number of features. To evaluate the proposed method, the NSL-KDD dataset is used, by comparing the use of all 41 features dataset and features subset obtained from the HG-GA method. The experimental results show that the HG-GA SVM method gives better accuracy and less processing time [13].
In a study conducted by Akashdeep, Ishfaq Manzoor, Neeraj Kumar, in the early stages of feature selection by ranking feature based on information gain and correlation. Furthermore, to do the classification used method of a neural network using back-propagation learning. At the time of this learning process used five different subsets of the KDD99 dataset. The subset of KDD'99 datasets is grouped based on information gain and correlation. The experimental results indicate the selection of features to reduce the number of features used in building the model gives results with better accuracy and performance of the classifier [14].
In their research Sumaiya Thaseen Ikram, Aswani Kumar Cherukuri used SVM method to build IDS based machine learning. In their research, they used SVM multiclass to detect intrusion. To get the best result in SVM multiclass, the parameter of C and gamma on RBF kernel function were tuned to get optimal SVM model. For feature selection, important features are ranked according to a set of rules based on performance using chi-squared analysis, and the best feature subset was selected for use in model development. Development and testing of models conducted using the NSL-KDD dataset. The experimental results show the multiclass SVM with the best-selected features and the proper setting of C and gamma parameters gives the most optimum results [15].
Prashant Kushwaha, Himanshu Buckchash, and Balasubramanian Raman perform development of machine learning based IDS by using KDD'99 dataset. To select the best feature, filter based techniques feature selection was carried out namely: correlation, gain ratio and Mutual Information. Some machine learning methods are tested in the learning process to choose the best model. The result SVM has shown the best performance among all other classifiers. Mutual Information based filter proved out to be more effective in comparison with other features selection techniques [16].
3 Proposed Method
3.1 Methodology
Our methodology in developing machine learning models for IDS is briefly illustrated in Figure 1. We also conduct simulation as the proof of concept of our methodology. In this simulation, we implement our model in the real network. The simulation process of our work is illustrated in Figur 2.
The steps in model development are as follows:
1. Data preprocessing: transform, rescale, and normalize the NSL-KDD dataset
2. Feature selection: rank the feature score and select the most relevant features, and remove unselected features from the dataset.
3. Model selection: tune the appropriate parameters of the model using an NSL-KDD dataset with selected features.
4. Model evaluation: estimate the performance of the model using NSL-KDD dataset and 10%KDDCUP dataset.
5. Save the final model obtained from model development in the file to be implemented on IDS.
The steps in Intrusion Detection Simulation are as follows:
1. Capture/sniff network traffic and save to file in pcap format.
2. Extract pcap file to obtain the network features based on KDD dataset format.
3. Detect intrusion from the pcap with KDD dataset format using the final model from the model development.
4. Classify the network traffic as normal or abnormal then display and save the result.
3.2 Dataset
The most publicly available and comprehensive dataset that is widely used in research and development of machine learning based IDS is the KDDCUP'99 dataset. The KDDCUP'99 dataset developed by MIT Lincoln Labs provides a standard dataset generated from simulations in military network environments and by encompassing various intrusions [17]. The dataset also provides network features that can be used to build and evaluate the machine learning based attack detection model.
The classification of the attacks types provided in the KDDCUP'99 dataset is first grouped into two types of connections, i.e., normal and attack connection. Furthermore, the attack connections are grouped again in 4 types of attack, that is:
* DoS (Denial of Service): a type of attack aimed at shutting down network services by flooding networks with certain data packets.
* Probe: is a type of attack intended to conduct surveillance and look for weakness information from a particular network address.
* R2L (Remote to Local): a type of attack that is performed to access a particular network address remotely illegally.
* U2R (User to Root): the type of attack that is performed to escalate user privileges to higher privileges such as superuser.
From 4 types of these attacks, each is divided into several sub-types of attacks. The complete attack classification is summarized in Table 1.
The data features provided in the KDDCUP'99 dataset consists of 4 types of features as given in Table 2-5.
In 2009 Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorban reviewed the KDDCUP'99 dataset, and found some weaknesses in the KDDCUP'99 dataset, including many redundant data that can influence the outcome of machine learning to become biased [18]. To fix the weaknesses found in the KDDCUP'99 dataset, then the NSL-KDD dataset is offered. The data structure and classification of attacks in the NSL-KDD dataset remain the same as KDDCUP'99, with structural improvements and eliminating duplicate records in the dataset.
In our research, we decided to use the NSL-KDD dataset with the consideration that the data is still representative to simulate the real network and it has been improved so that it can produce better non-biased learning models due to duplicate data. NSL-KDD data consists of training data and test data, each of which is stored in 2 separate files. For testing purposes in this research, we also prepare test data taken from 10% dataset KDDCUP'99.
NSL-KDD data that has been separated between the training data and data testing is then performed preprocessing data which includes transformation, scaling, standardization, and normalization. Training data is used during model development, including features selection and model selection. While data testing is used to evaluate learning outcomes of each model. The model learning outcomes will also be evaluated using test data taken from the 10% dataset of KDDCUP'99.
For the proof of concept of our proposed method, we also set up a dataset that is captured from a real network. The data packet captured from the real network is stored in the file in pcap format. Furthermore, this pcap file is extracted to get the appropriate features and stored in the format of the KDD dataset structure format. An overview of the network packet capture system topology as presented in Figure 3.
To simulate attack data packets in this simulation system, we used some attack tools as follows:
DoS : HOIC, Slowloris Scanner : Uniscan, nmap Bruteforce: Burp Suite, Intruder
The real-time capture dataset in KDD format is then used to test the model already developed to detect whether it is normal or attack network traffic.
To perform real-time network data extraction as well as from pcap files to network features that correspond to the KDD structure we follow the explanation of the KDD dataset [16] and explanation of Lee & Stolfo in their work [18]. Based on the available references we can only extract the features subsets: basic features (1-9), time based traffic features (23-31) and connection based traffic features (23-31), while for content features (10-22) we cannot extract it because there is no documentation explaining how to determine the value in the content features. In the explanations described in KDD [17] and Lee & Stolfo [19], the content features are determined by the domain expert knowledge without explaining how to determine the value. From the results of our observations we also found that the value of the content features is related to certain systems/applications in the simulation performed by MIT Lincoln Labs so that these content features are not suitable to be generally applied to other systems/applications. From the development of the internet, network system today also found that most of the network traffic on the internet is currently encrypted, making it difficult to get the content features. [20]
3.3 Data Preprocessing
The learning algorithm in machine-learning has a close bond with certain data types and structures so that its performance is strongly influenced by the data available. Most machine-learning methods are developed based on certain assumptions regarding the type and structure of the data being processed. To get the best results in building classification model, data preparation is required to meet the criteria required by the machine learning method that we use. Using inappropriate data with the machine-learning method will result in a poor model that cannot provide the correct predictions [21].
The data preprocessing in this research include transformation, scaling, standardization, and normalization. Data transformation is process to change the original data type into the data type required by the machine learning method. Scaling is carried out to adjust the value of all data features so that it is on the same scale. Standardization of data features is performed to adjust the feature value so that it follows the Gaussian distribution of having a mean value of 0 and standard deviation 1, which is particularly useful for machine learning methods which assume that the processed data follow the Gaussian distribution.
3.4 Feature Selection
Feature selection is the process of selecting features in our data that contribute to the prediction or output variables we expect. The data are available for the classification model sometimes not necessarily appropriate to the implementation in real work. For these reasons, we need to select the features to determine the most relevant features and by the implementation in a real application.
There are several methods to perform feature selection. In general, Isabelle Guyon and Andre Elisseeff [22] categorize as follows:
1. Filter Methods that is selecting features using statistical measurements. Each feature is scored based on statistical calculations; then the threshold is specified to decide the features to be included in model formation.
2. Wrapper Methods, which is looking for the most optimal combination of features by evaluating some combination of features and calculating the score based on the model accuracy.
3. Embedded Methods, which is choosing features when building a model, for example using regularization algorithm.
4. Ad-hoc feature selection, which is choosing features based on domain expert knowledge.
The main objective of this research is to get the best implementable model. Our main focus is designing systems by implementing machine learning in IDS. One important job is to determine the structure and content of the data most likely to be applied to the IDS in the real network. In determining the structure and content of this data one of the jobs is to choose the features that best suit the conditions in the field and can be applied at the time of implementation.
To determine the structure and content of this most appropriate data in the feature selection stage we used some feature selection methods and compare the result, then to determine the final result of the feature selection using domain expert knowledge related to the implementation of machine learning model on IDS. Feature selection method is conducted by calculating the score of each feature with wrapper approach method and univariate feature selection. Then the features ware ranked by its score, and features with a high score were selected. The main goal in this phase is to select features which provide high quality and practical training data for the selected classifier algorithm. Then, the classifier is trained with the new data with selected features subset, and finally, the intrusion detection model is built.
Feature selection with wrapper approach involves evaluation of features through checking the accuracy of the models learned from different subsets of features. In this method, we implement a meta estimator that fits some subsets and compute features of importance. The feature subset, which leads to the best model of learning, is selected. The feature selection with the wrapper approach resulted from the fact that the combination of the features and the characteristics of the main features of the study. [23]
Univariate feature selection works by selecting the best features based on univariate statistical tests, in this work we use chi-square. Univariate feature selection determines the strength of the relationship between the feature and the response variable by examining each feature individually. Chi-square is used for assessing two kinds of comparing: tests of independence and tests of goodness of fit. In feature selection, a test of independence is assessed by chi-square and estimate whether the class is independent of a feature [24].
3.5 Model Selection
In the model selection, some learning algorithms were observed, and the model gives the best accuracy results were selected. To get the best results from each model of the learning algorithm, we optimize the parameters of each model before the learning process. In this work, we tuned the parameters with grid search and cross-validation method. The grid search method is to look for combinations of parameters that provide the best learning outcomes, by trying different combinations of parameters in the learning algorithm. Cross-validation is the process of training learners using one set of data and testing it using a different set. In our research we used 5-fold cross-validation, i.e., the dataset is divided into five subsets: 4 subsets for learning and one subset for testing [25].
In the selection model, the learning process is conducted using four different methods, i.e., Naive Bayes, Neural network, K-NN, and SVM. Each of these methods was evaluated using training dataset that has been prepared to conduct learning and the features subset that have been selected in the features selection process. The models obtained from learning process then were tested using testing dataset, i.e., NSL-KDD test dataset and 10% and 20% of 10% KDDCUP dataset. The results of accuracy and timing of each model were compared and analyzed to determine the best model to be used in the next process.
3.6 Model Evaluation
To evaluate the model, we measured the performance of the classifier in classifying the data correctly. Generally, performance measurement of a classification model is conducted by using confusion matrix. The confusion matrix is a table that records the number of original class label data and predicted class label. By comparing the amount of the original class label data and the predicted class label, we can get the number of correct predictions and the number of false predictions.
From the number data of correct predictions and false predictions are then used to calculate the accuracy, error rate, etc. of the classification, using commonly used formulas, one of which is from [26], as follows:
True Positives (TP) - Is a correctly predicted positive value, that is the value of the actual class, and the predicted class value is equally positive.
True Negatives (TN) - Is a correctly predicted negative value, that is the value of the actual class, and the predicted class value are equally negative.
False Positive (FP) - Is an incorrectly predicted positive value, that is a class value that is positive, but the value of the prediction class is negative.
False Negative (FN) - Is an incorrectly predicted negative value, that is a class value that is negative, but the value of the prediction class is positive.
Accuracy - Accuracy is the most intuitive measure of performance, and this is just a precisely predicted observation ratio for total observation.
Accuracy = TP + TN / TP + FP + FN + TN
Precision - Precision is a predictably positive predictor ratio to total positive observation predictions.
Precision = TP / TP + FP
Recall (Sensitivity) - Recall is a well-predicted positive observation ratio for all observations in the actual class - yes.
Recall = TP / TP + FN
F1 Score - The F1 score is the weighted average of Precision and Recall. Therefore, this score takes false positives and false negatives into account.
F1 Score = 2 * (Recall * Precision) / (Recall + Precision)
4 Experiment and Analysis
In this research, we divided the work into 2 phases: Model Development Phase and Simulation Phase. In Model Development Phase we experimented to produce machine learning model which will be implemented in real IDS. In Simulation Phase, we conducted a simulation to test the machine learning model that has been developed in a system that detects attacks from the real network.
4.1 Experiment Setup
In running the experiment in this research, we utilized computer facilities for data processing with the following specifications:
Processor: 2,5 GHz Intel Core i5 Memory: 8 GB 1600 MHz DDR3 Operating System: Ubuntu 17.10
For data processing with machine learning, we used Python programming language with machine learning tools Scikit-learn [27].
4.2 Model Development Phase
The model development phase was carried out to create a machine learning model that will be used for detection of attacks on real networks. The model development phase includes the process of data preparation, features selection, model selection, and model evaluation.
Data used in this phase were training data from NSL-KDD and test data from NSL-KDD and 10% and 20% of 10%KDDCUP'99. The description of the dataset is given in Table 6.
In the features selection process, we performed two methods of feature importance with wrapper approach and univariate chi-square. The results of each feature selection method are as presented in Table 7. In this table the shaded cells are features with lower score, and may be removed.
From the result of some feature selection methods, we choose the features that produce the best model and enable to be implemented in real IDS. As the final decision of features selection, we use domain expert knowledge consideration related to its implementation in the real network. Based on the description of KDDCUP'99 [17] and NSL-KDD [28] it is known that content features subset (number 10-22) from Table 3, the data is obtained based on expert knowledge domain. From our analysis and observation, we get that the data on this features subset was obtained from certain applications and cannot be generalized. Furthermore, the trend of current internet applications that data on these features is encrypted it is difficult to get these features data from real network packets. Based on these considerations, in this study, we remove the content feature (number 10-22). Further, we compared the learning accuracy and performance of the model with 41 features and with 28 features.
Based on features score ranking, then 19 best features of the wrapper approach and 21 best features of univariate chi-square were tested using the SVM classifier. The tests with the SVM classifier were performed to compare the accuracy results with all 41 features dataset, 21 features subset, and 19 features subset. In this test, we also used a subset of 28 features that are the result of feature reduction of a total of 41 KDD features without content features (10-22) based on domain expert knowledge. The accuracy results from each features subset can be seen in Table 8.
From the accuracy of each features subset in Table 8, it can be seen that a subset with 28 features provides the best accuracy results, thus in the selection of our model we use the 28 features subset.
In the model selection stage, we use a dataset with 28 features (without content features). Some of the learning algorithms we evaluated are SVM, Naive Bayes, KNN, and Neural Network. We conducted learning the process for classification of two-class (normal and attack) and multiclass (normal, DoS, Probe, R2L, and U2R). The result of prediction for each learning algorithm for classification of two-class presented in Table 9, and for classification of multiclass presented in Table 10. Comparison of accuracy each learning algorithm for classification of two-class presented in Figure 4, and for classification of multiclass presented in Figure 5.
From the learning process with some learning algorithms as presented in Figure 4 and Figure 5, KNN gives the best accuracy result but requires very long learning and prediction process as shown in Table 9 and Table 10. SVM algorithm gives excellent and stable results in training and test accuracy with fast learning and prediction time.
Considering the purpose of model development in this research to be implemented in the real network, we required a classifier model which gives good accuracy result and fast processing time. We decided to use SVM for the next process of building the classification model to be implemented for IDS in a real network.
4.3 Simulation Phase
In the simulation phase, we built the SVM classification model with the NSL-KDD dataset. Furthermore, the results of model development for both classifications of two-class and multiclass were tested using the NSL-KDD test dataset, and 10% and 20% of 10% KDDCUP'99 dataset. The final model of this development result was then used to detect intrusion from pcap data captured from the real network as described in section 4.2 of this paper. The prediction accuracy of SVM classification model can be seen in Table 9 and Table 10. The result of prediction and processing time with test data and intrusion detection from pcap data with the classification of 2 classes is presented in Table 11, and classification with multiclass is presented in Table 12. The example view of intrusion detection results from real networks is provided in Figure 6.
From the experimental results using two-class and multiclass SVM classification model with new dataset without involving content features (feature 10-22) show very good result. In the learning process two-class obtained 99.9% accuracy with a processing time of 8 seconds 20 milliseconds, and the results in the multiclass classification accuracy of 99.9% with a processing time of 9 seconds 80 milliseconds. The prediction results in NSL-KDD test data for classification of two-class got 82,6% accuracy with process time 1 second 28 milliseconds, and the result of classification of multiclass accuracy 83,7% with process time 1 second 25 milliseconds. Predicted results in 10% KDD10% test data for classification of two-class obtained 98.8% accuracy with process time 3 seconds 24 milliseconds and results in multiclass classification accuracy 88.3% with process time 3 seconds 26 milliseconds. Predicted results in 20% of KDD10% test data for two-class classification obtained 98.8% accuracy with a processing time of 6 seconds 93 milliseconds and results in multiclass classification accuracy 89.4% with a processing time of 6 seconds 98 milliseconds.
5 Conclusions
We have successfully developed an intrusion detection model using machine learning approach which applies to a real network. Removing the content features (feature 10-22) from the structure of the KDD dataset results in an attack detection classification model with high accuracy. The experiment without involving the content features of KDD dataset gives the best classification accuracy. From our experiments on the feature selection, it is shown that most of the content features have low scores so that they can be eliminated from the dataset. The developed model implemented on IDS works well in detecting attacks in a real network.
The results of the experiment on several classification models show that SVM provides the best performance results with average accuracy on the test data of 93.4% for classification of two-class, and 86.8% for multiclass classification. Processing time on the test dataset for classification of two-class and multiclass is quite short. Experiments on real networks, the IDS that have been developed the ability to detect attacks with 6.839.023 network traffic within 7 minutes, 49 seconds, 30 milliseconds.
6 Acknowledgement
This article's publication is supported by the United States Agency for International Development (USAID) through the Sustainable Higher Education Research Alliance (SHERA) Program for Universitas Indonesia's Scientific Modeling, Application, Research, and Training for City-centered Innovation and Technology (SMART CITY) Project, Grant #AID-497-A-1600004, Sub Grant #IIE-00000078-UI-1.
References
[1] Zayed Al Haddad, Mostafa Hanoune, and Abdelaziz Mamouni, "A Collaborative Network Intrusion Detection System (C-NIDS) in Cloud Computing," International Journal of Communication Networks and Information Security (IJCNIS) vol. Vol. 8, No. 3, December 2016, pp. 130-135, 2016.
[2] A. L. Buczak and E. Guven, "A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection," in IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176, Second quarter 2016.
[3] Monowar Hussain Bhuyan, D K Bhattacharyya and J K Kalita "Survey on Incremental Approaches for Network Anomaly Detection" International Journal of Communication Networks and Information Security (IJCNIS) vol. Vol. 3, No. 3, December 2011, pp. 226-239, 2011.
[4] M. Tavallaee, N. Stakhanova and A. A. Ghorbani, "Toward Credible Evaluation of Anomaly-Based Intrusion-Detection Methods," in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 5, pp. 516-524, Sept. 2010.
[5] Amiruddin, Anak Agung Putri Ratna, and Riri Fitri Sari " New Key Generation and Encryption Algorithms for Privacy Preservation in Mobile Ad Hoc Networks" International Journal of Communication Networks and Information Security (IJCNIS) vol. Vol. 9, No. 3, December 2017, pp. 376-385, 2017.
[6] Trustworthy Internet Movement (www.trustworthyinternet.org).
[7] https://www.nsslabs.com/company/news/press-releases/nss-labs-predicts-75-of-web-traffic-will-be-encrypted-by-2019/
[8] Blake Anderson, David McGrew, "Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity", Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada--August 13 - 17, pp. 1723-1732, 2017.
[9] R. Koch, M. Golling and G. D. Rodosek, "Behavior-based intrusion detection in encrypted environments," in IEEE Communications Magazine, vol. 52, no. 7, pp. 124-131, July 2014.
[10] J. Lekha, and Padmavathi Ganapathi, " Detection of Illegal Traffic Pattern using Hybrid Improved CART and Multiple Extreme Learning Machine Approach," International Journal of Communication Networks and Information Security (IJCNIS) vol. Vol. 9, No. 2, August 2017, pp. 164-171, 2017.
[11] Mohammed A. Ambusaidi, Xiangjian He, Priyadarsi Nanda, and Zhiyuan Tan, " Building an Intrusion Detection System Using a Filter-Based Feature Selection Algorithm," IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, October 2016.
[12] Chaouki Khammassi, Saoussen Krichen, " A GA-LR wrapper approach for feature selection in network intrusion detection," computers & security 70 (2017) 255-277. Available: http://dx.doi.org/10.1016/j.cose.2017.06.005
[13] M.R. Gauthama Raman, Nivethitha Somu, Kannan Kirthivasan, Ramiro Liscano, V.S. Shankar Sriram, "An efficient intrusion detection system based on hypergraph - a Genetic algorithm for parameter optimization and feature selection in support vector machine," Knowledge-Based Systems 134 (2017) 1-12. Available: https://doi.org/10.10167j.knosys.2017.07.005
[14] Akashdeep, Ishfaq Manzoor, Neeraj Kumar, "A feature reduced intrusion detection system using ANN classifier," Expert Systems With Applications 88 (2017) 249-257 Available: http://dx.doi.org/10.1016/j.eswa.2017.07.005
[15] Sumaiya Thaseen Ikram, Aswani Kumar Cherukuri, "Intrusion detection model using a fusion of chi-square feature selection and multiclass SVM," Journal of King Saud University - Computer and Information Sciences (2017) 29, 462-472. Available: http://dx.doi.org/10.1016/j.jksuci.2015.12.004
[16] Prashant Kushwaha, Himanshu Buckchash, and Balasubramanian Raman, "Anomaly-Based Intrusion Detection Using Filter Based Feature Selection on KDD-CUP 99," Proceeding of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017.
[17] KDD Cup 1999 Data http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[18] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, "A Detailed Analysis of the KDD CUP 99 Data Set," The second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.
[19] Lee and Stolfo, "A framework for constructing features and models for intrusion detection systems", ACM Transaction on Information and System Security, vol. 3, issue 4, pp. 227-261, 2000.
[20] Jason Deign, "The encryption that protects your online data can also hide malware", https://newsroom.cisco.com/feature-content?type=webcontent&articleId=1853370.
[21] Malley B., Ramazzotti D., Wu J.T., "(2016) Data Preprocessing," Secondary Analysis of Electronic Health Records. Springer, Cham, pp 115-141. Available: https://doi.org/10.1007/978-3-319-43742-2_12
[22] Isabelle Guyon, Andre Elisseeff "An Introduction to Variables and Feature Selection" Journal of Machine Learning Research 3 (2003) 1157-1182.
[23] Han Lu, Mihaela Cocea, Weili Din, "Decision tree learning based feature evaluation and selection for image classification," Proceeding of The 2017 International Conference on Machine Learning and Cybernetics (ICMLC).
[24] Nachirat Rachburee and Wattana Punlumjeak, "A Comparison of Feature Selection Approach Between Greedy, IG-ratio, Chi-square, and mRMR in Educational Mining, Proceeding of The 2015 7th International Conference on Information Technology and Electrical Engineering (ICITEE), Chiang Mai, Thailand.
[25] Jun Lin, Jing Zhang, "A Fast Parameters Selection Method of Support Vector Machine Based on Coarse Grid Search and Pattern Search", Proceeding of 2013 Fourth Global Congress on Intelligent Systems.
[26] Powers, David M W, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation," Journal of Machine Learning Technologies. 2 (1): 37-63.
[27] Pedregosa et al., "Scikit-learn: Machine Learning in Python," JMLR 12, pp. 2825-2830, 2011.
[28] NSL-KDD Dataset, http://www.unb.ca/cic/datasets/nsl.html
Bisyron Wahyudi (1), Kalamullah Ramli (2), and Hendri Murfi (3)
(1,2) Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Indonesia (3) Department of Mathematics, Faculty of Science, Universitas Indonesia, Indonesia
Table 1 Network Traffic Classification Two-Class normal abnormal/attack Multiclass (5 Labels) normal DoS Probe R2L U2R Multiclass (23 Labels) normal smurf satan warezclient buffer overflow neptune ipsweep guess passwd rootkit back portsweep warezmaster loadmodule teardrop nmap imap perl pod ftp write land multihop phf Table 2 Basic features of individual TCP connections No Feature Name Description Type 1 duration length (number of seconds) continuous of the connection 2 protocol_type type of the protocol, e.g., tcp, discrete udp, etc. 3 service network service on the discrete destination, e.g., http, telnet, etc. 4 src_bytes number of data bytes from continuous source to destination 5 dst_bytes number of data bytes from continuous destination to source 6 flag normal or error status of the discrete connection 7 land 1 if connection is from/to the discrete same host/port; 0 otherwise 8 wrong_fragment number of 'wrong' fragments continuous 9 urgent number of urgent packets continuous Table 3 Content features within a connection suggested by domain expert knowledge No Feature Name Description Type 10 hot number of 'hot' indicators continuous 11 num_failed_logins number of failed login continuous attempts 12 logged_in 1 if successfully logged in; discrete 0 otherwise 13 num_compromised number of compromised continuous conditions 14 root_shell 1 if root shell is obtained; discrete 0 otherwise 15 su_attempted 1 if 'su root' command discrete attempted; 0 otherwise 16 num_root number of 'root' accesses continuous 17 num_file_creations number of file creation continuous operations 18 num_shells number of shell prompts continuous 19 num_access_files number of operations on continuous access control files 20 num_outbound_cmds number of outbound continuous commands in an ftp session 21 is_hot_login 1 if the login belongs to discrete the 'hot' list; 0 otherwise 22 is_guest_login 1 if the login is a 'guest' discrete login; 0 otherwise Table 4 Traffic features computed using a two-second time window No Feature Name Description Type 23 count number of connections to continuous the same host as the current connection in the past two seconds 24 serror_rate % of connections that have continuous 'SYN' errors 25 rerror_rate % of connections that have continuous 'REJ' errors 26 same_srv_rate % of connections to the continuous same service 27 diff_srv_rate % of connections to continuous different services 28 srv_count number of connections to continuous the same service as the current connection in the past two seconds 29 srv_serror_rate % of connections that have continuous 'SYN' errors 30 srv_serror_rate % of connections that have continuous 'REJ' errors 31 srv_diff_ho st_rate % of connections to continuous different hosts Table 5 Traffic features computed using the previous 100 connections No Feature Name Description Type 32 dst_host_count count of destination host continuous 33 dst_host_srv_count count of continuous destination host service 34 dst_host_same_srv_rate same service rate continuous for destination host 45 dst_host_diff_srv_rate difference continuous service rate for destination host 36 dst_host_same_src_port_rate same source port continuous rate for destination host 37 dst host srv diff host rate difference host continuous rate for destination host service 38 dst_host_serror_rate % 'SYN' errors continuous of destination host 39 dst_host_srv_serror_rate % 'SYN' errors continuous of destination host service 40 dst_host_rerror_rate % 'REJ' continuous errors of destination host 41 dst_host_srv_rerror_rate % 'REJ' continuous errors of destination host service Table 6 Dataset Description Data Total Normal Dos Probe R2L U2R NSL Train 125.972 67.342 45.927 11.656 995 52 NSL Test 18.793 9.711 5.740 1.106 2.199 37 10% KDD10% 49.402 9.751 39.139 388 123 1 20% KDD10% 98.804 19.577 78.196 797 228 6 Table 7 Feature Selection Score Univariate Chi Features Square Wrapper duration 2.826519543 63,54787209 protocol_type 0.172328957 295,6548746 service 8.243793956 121,9684166 flag 5.465393293 1910,788543 src_bytes 1089.543299 175,9347821 dst_bytes 7126.314318 117,271535 land 0.00065122 175,6223816 wrong_fragment 0.226732233 96,22986459 urgent 0.000181718 43,11841727 hot 0.040297298 83,83387513 num_failed_logins 0.000297401 26,57231803 logged_in 3.625973858 679,0578386 num_compromised 0.195937472 38,5219464 root_shell 0.005176937 65,36687351 su_attempted 0.01173053 93,63296562 num_root 0.457813131 15,44384933 num_file_creations 0.100395265 13,60867399 num_shells 0.001347134 56,37072523 num_access_files 0.040903397 29,66709078 num_outbound_cmds 0 0 is_host_login 8.70631E-05 43,81715626 is_guest_login 0.019253195 105,9430942 count 652.5616721 82,83794029 srv_count 0.000272025 53,6925844 serror_rate 329.0606084 755,9977392 srv_serror_rate 317.0746204 101,812899 rerror_rate 55.81489236 70,47880838 srv_rerror_rate 42.61256533 159,1656237 same_srv_rate 208.264141 1480,058033 diff_srv_rate 26.63040191 169,1683489 srv_diff_host_rate 14.15434894 91,17000547 dst_host_count 95.74258043 437,2705228 dst_host_srv_count 696.8826698 638,6550968 dst_host_same_srv_rate 234.4820854 100,187664 dst_host_diff_srv_rate 31.97842615 77,14903087 dst_host_same_src_port_rate 6.927567558 255,3532438 dst_host_srv_diff_host_rate 1.456325149 129,8679041 dst_host_serror_rate 372.2546649 293,9741698 dst_host_srv_serror_rate 381.5193858 416,1960159 dst_host_rerror_rate 63.54755808 323,118674 dst_host_srv_rerror_rate 68.66941117 111,8726018 Table 8 Selected Feature Model Accuracy Features Precision Recall F1-score AUC Accuracy Wrapper (19) 0.998157 0.997663 0.99791 0.99803 0.998055 Univariate (21) 0.996335 0.996964 0.99665 0.996886 0.99688 28 0.998498 0.998363 0.998431 0.998528 0.998539 41 0.998357 0.998284 0.998071 0.998144 0.998135 Table 9 Two-Class Model Performance Method Data Test Precision Recall F1-score Accuracy SVM NSL Test 0,837 0,793 0,815 0,826 KDD10% 0,997 0,988 0,993 0,988 20KDD10% 0,997 0,989 0,993 0,988 Naive Bayes NSL Test 0,928 0,703 0,800 0,830 KDD10% 0,885 0,284 0,430 0,396 20KDD10% 0,885 0,284 0,430 0,396 KNN NSL Test 0,959 0,803 0,874 0,888 KDD10% 0,998 0,992 0,995 0,992 20KDD10% 0,998 0,992 0,995 0,992 Neural Network NSL Test 0,951 0,749 0,838 0,860 KDD10% 0,996 0,993 0,994 0,990 20KDD10% 0,995 0,993 0,994 0,990 Table 10 Multiclass Model Performance Method Data Test Precision Recall F1-score Accuracy SVM NSL Test 0,837 0,837 0,837 0,837 KDD10% 0,883 0,883 0,883 0,883 20KDD10% 0,884 0,884 0,884 0,884 Naive Bayes NSL Test 0,830 0,830 0,830 0,830 KDD10% 0,350 0,350 0,350 0,350 20KDD10% 0,351 0,351 0,351 0,351 KNN NSL Test 0,882 0,882 0,882 0,882 KDD10% 0,989 0,989 0,989 0,989 20KDD10% 0,990 0,990 0,990 0,990 Neural Network NSL Test 0,826 0,826 0,826 0,826 KDD10% 0,205 0,205 0,205 0,205 20KDD10% 0,207 0,207 0,207 0,207 Table 11 Two-Class Data SVM Prediction Result Data Total Normal Attack Time NSL X_train 125.972 67.350 58.622 0:00:08.80 NSL X_test 18.793 10.188 8.605 0:00:01.21 KDD10% 10% 49.402 10.090 39.312 0:00:03.77 KDD10% 20% 98.804 20.195 78.609 0:00:06.86 PCAP Data 6.839.023 5.063.475 1.775.548 0:07:49.30 Tablea 12 Multiclass Data SVM Prediction Result Data Total Normal Dos NSL X_train 125.972 67.344 45.936 NSL X_test 18.793 10.575 5.476 KDD10% 10% 49.402 15.075 33.684 KDD10% 20% 98.804 30.233 67.300 PCAP Data 6.839.023 2.854.841 2.053.314 Data Probe R2L U2R Time NSL X_train 11.661 989 42 0:00:09.09 NSL X_test 1.778 963 1 0:00:01.03 KDD10% 10% 399 244 - 0:00:03.65 KDD10% 20% 827 444 - 0:00:07.12 PCAP Data 1.860.330 66.291 4.247 0:08:29.50
![]() ![]() ![]() ![]() | |
Author: | Wahyudi, Bisyron; Ramli, Kalamullah; Murfi, Hendri |
---|---|
Publication: | International Journal of Communication Networks and Information Security (IJCNIS) |
Article Type: | Report |
Date: | Aug 1, 2018 |
Words: | 7562 |
Previous Article: | Performance Analysis of User Speed Impact on IEEE 802.11ah Standard affected by Doppler Effect. |
Next Article: | Performance Analysis in Wireless Powered D2D-Aided Non-Orthogonal Multiple Access Networks. |
Topics: |