Research on the network security based on PrefixSpan algorithm in data mining.
In the past decades, information technology has turned the world over and changed the whole world. Information in people's production and play an increasingly important role in life, people increasingly rely on information technology based products and created on the basis of the information technology of the information industry has become an important pillar industry of the world economy (Evgeniou, 2005;Bach, 2009; Aloysius, 2013). With the rapid increase of Internet scale and capacity, caused of complicated equipment and management of complications, it brought endless for an open network security hidden danger, the general information security problem appears, showed more deteriorating trend in recent years, countries around the world to produce the huge economic threat, so the information security has become the modern world is increasingly attach importance to research an important topic (George, 2012; Hegde, 2012; Goel, 2015). In the figure one, we show the statistical data of the information security forms.
As for the essential technique for the information security enhancement, the data mining is the most important one. Data mining algorithms can be accurate mean coding is a limited rules, the process of its data as input and produce output in the form of model or pattern which contains a lot of algorithms (Jia-xin, 2012; Kaur, 2014; Hijawi, 2015). Most of the data mining algorithm using one or a few target function, the use of some search methods, to find out in the data in the body or to establish the relationship between the distance of a point in the data space, or small area (Kumar, 2013; Li, 2014; Pereira, 2015). The data mining algorithm based on mining methods can be divided into teachers and no teacher which is also known as the corresponding supervised learning and unsupervised learning.
As the core component of the data mining, the basic and essential clustering methods could be summarized as the following three unique parts. (1) Bayesian classification algorithm. Bayesian classification algorithm is a new kind of the probability and statistics knowledge classification algorithms, such as the NB algorithm. The algorithm is mainly using Bayes' theorem to predict an unknown sample belongs to the possibility of each category, choose one of the most likely category as the sample's ultimate category. Since the establishment of the Bayes' theorem itself needs the strong prior independence assumption and that the assumption in the actual situation is often not formed, thus classification accuracy will drop. (2) Artificial neural network. Artificial neural network as a kind of the powerful tool for handling nonlinearity, uncertainty, at present there are many limitations. First of all, the black-box type internal knowledge expression of the network itself, make it can't use the initial experience to learn, easy to fall into local minimum value. Secondly, in terms of nature, and artificial neural network is to use static network processing continuous time control problem of the dynamic system. This inevitably brings difference model order and the complexity of the network scale increases quickly with order problem (Parimala, 2012; Song, 2013; Schweizer, 2015). The global approximation generalization ability is conditioned by a large number of the local general minima and slow learning speed and local approximation is severely limited by storage capacity and the real-time performance. (3) Based on the classification of the association rule. Association rules mining is the discovery of a large amount of data between item sets in the interesting association or its associated process is an important topic in data mining, and applied in the actual right now, it has been widely used in every field. The existing various kinds of association rule mining algorithms can be roughly divided into search algorithms, hierarchical algorithm (Wang, 2013; Wang, 2015). The parallel algorithm is the general collection of process can be performed at the same time, these processes interact to coordinate, to complete to a problem solving. The goal is to minimize the time complexity of parallel algorithm, to achieve this goal and try to make each moment increase the computing tasks can be performed independently which make whole computational steps to reduce as much as the possible or by increasing the algorithm complexity of each time step to reduce the overall time complexity of the whole and appropriately increased space complexity.
Based on the literature review the above mentioned issues, in this paper, we will conduct research on the network security based on PrefixSpan algorithm in data mining. In the following sub-sections, we will discuss in detail.
2. The data clustering and mining algorithms
2.1. The overview of the data clustering and the functionality
With the popularity of the Internet and e-commerce, information rapid growth, the Internet is chock full of rich variety of information, this is both opportunity and challenge for us and in the vast ocean of information search the information we need to become more difficult. Collaborative filtering is believed to be the most popular and effective a recommended technology, its effect is superior to other methods recommended. However, it also has some defects, such as data sparseness and system scalability and the cold start problem and these are formed by their internal working mechanism.
In order to decrease the complexity of the recommendation system recommended the online calculation and solve the problem of scalability, guarantee recommends system's real-time demand, are generally in the data preprocessing phase data dimension reduction. At present, there are two big data dimension reduction methods: feature selection and feature reconstruction. The goal is to throw away something for feature selection for the classification, the characteristics of the small contributions to achieve a certain degree of dimension reduction and feature reconstruction method usually uses a smaller from initial characteristics derived from the original feature set of the feature set. The above method to reduce the dimension of information will be lost at the same time and the feature space can retain the original characteristics of the major global information. This paper promotes the use of the clustering method for dimension reduction, the guarantee without loss of data information to reduce the computational load and improve the scalability of the system, thus achieve the goal of real-time recommendation.
Each have advantages and disadvantages of these clustering methods, such as hierarchical clustering method of data set can be decomposed into several levels of clustering, but cannot automatically judge the results in the appropriate category number and the two step clustering method can produce different number of the clustering discriminant information, the final clustering cluster frequency and descriptive statistics, it is difficult to intuitively clustering results are obtained. In the following sub-sections, we will introduce our unique methodology denoted as the dynamics clustering with the combination of the two basic components.
2.2. The principles of fuzzy clustering
Growing hierarchical self-organizing map networks can be either the growth of the horizontal direction which also can have the growth of vertical direction. For level of growth, the self-organizing map network in a symmetrical manner to increase the size of the network, so that each neuron not because to represent too much input vector and become inaccurate. But for the vertical growth, its principle is to the general core periodically check the lowest layer of the self-organizing map network ever achieve the potential input data fully covered state if that does not fully cover the input data for vertical growth.
Divergence measuring the difference degree between the distribution, in punishing the likelihood function introduced in consistent constraint regularization operator to adjust the model parameter estimates, that satisfy the constraints are paired samples have largely similar posterior distribution and the negative constraint paired sample has bigger difference of the posterior distribution, thus by punishing constraint violations to realize constrained clustering. As a kind of statistical model parameters, the probability model for the function could be expressed as follows.
p([x.sub.i]|[THETA]) = [K.summation over (k=1)] [[pi].sub.k][p.sub.k]([x.sub.i]|[[theta].sub.k]) (1)
By batch process time series expansion of the data matrix and covariance matrix of each batch data on average can process at the same time between the static and dynamic characteristics, but the extension data array restricted by delay length selection, characterization of the dynamic characteristics of the local time period only. The formula 2 shows the feature.
p([x.sub.i]|[theta]) = 1/[(2[pi].sup.)D/2][[absolute value of [[summation].sub.k]].sup.1/2] exp (-1/2 [([x.sub.i] - [[xi].sub.k]).sup.T] ([x.sub.i] - [[xi].sub.k])) (2)
In the process of the monitoring method can capture the dynamic and nonlinear, but can't effective initial conditions caused by the volatile weak fault detection problems, because of the weak fault has been submerged under the fluctuation of the initial conditions. Function is defined as the cluster-heads actual connection number connected with random circumstances within the cluster the difference between the expected of connection, used to quantitatively depict network cluster structure that is demonstrated as the expression three.
Q = [[summation].sup.K.sub.s=1] [[m.sub.s]/m - [([d.sub.s]/2m).sup.2]] (3)
We will have a variety of the structure of the vector data set into the network, through the observation, we found that if the appropriate values, so the network with the original data set with the same type of cluster structure, the cluster nodes are interconnected in densely populated, cluster nodes are interconnected between sparse features and on the network clustering can get clustering results of the original data set. Modeling the performance of the stand or fall that capture the nonlinear characteristics of the process of performance depends on the selection of kernel function is good or bad, so the choice of kernel function is critical, commonly used kernel function could be summarized as the following equations.
[k.sub.polynomial] (X y) = [<X y>.sup.d] (4)
[K.sub.ensing] (x,y) = tanh* ([[beta].sub.0],<x,y> + [[beta].sub.1]) (5)
[K.sub.ernel] (x,y) = exp (-[parallel]x - y[[parallel].sup.2] / C) (6)
The algorithm in the structured similarity is as the division standard, its each time will be removed from the network structural similarity value is the smallest one side, at the end of the algorithm is a hierarchical network clustering results are obtained. But the every time after removed from the network edge will be made to the edge of any vertex as vertices of all the structural similarity of value change. We found that if considering the impact of this change, namely every after deleting an edge to recalculate the affected side of all the structural similarity of value and then clustering quality of the algorithm will get great improvement. Subject's main task is to collect effective information as much as possible, and then submitted to the center console. This compared to requirement of centralized control of computing power, so the system must consider when choosing the best performance of the computer to configure the center console. In the following figure, we show the related demonstration.
[FIGURE 2 OMITTED]
2.3. Support vector machine based rough clustering
Support vector machine is a structural risk minimization of the statistical learning theory thought in the actual one. It in solving small sample, nonlinear and high dimensional pattern recognition, the regression estimation problem shows good generalization ability, and there exists no problem of the local optimum. For the prediction model, its performance depends largely on the use of the training sample. But in many cases, it is difficult to obtain all the typical samples, so often need to adopt incremental learning technology, namely in the use of existing study after completion of the training sample of new samples in incremental approaches for related training and we will firstly training data set, the nonlinear mapped to a high-dimensional feature space and make the nonlinear function estimation problem is transformed into the linear function estimation problems in the high dimensional feature space denoted as the following.
f([x.sub.i]) = w[psi]([x.sub.i]) + b (7)
Optimal based on support vector regression function refers to satisfy the structural risk minimization principle which could be named as the minimization.
J = 1/2 [[parallel]w[parallel].sup.2] f + C[R.sub.emp][f] (8)
Support vector machine incremental learning algorithm based on clustering, due to its full advantage of the history in the new training results of training and reduces significantly the subsequent training time. Solving type minimum risk function is equivalent to solving the optimization problem can be represented as below.
min J = 1/2 [[parallel]w[parallel].sup.2] + C [l.summation over (1)]([[gamma].sup.-.sub.i] + [[gamma].sup.+.sub.i]) (9)
Nearest neighbor clustering algorithm and the sample clustering into the several subsets, then the subset sample support vector regression model is established for support vector, is the standard training samples. Therefore, when the new sample arrived, it was first clustering to the appropriate one is concentrated, to obtain the most close to the standard and new sample after sample clustering set, in order to distinguish the importance of each training sample, for each training sample with proper initial weights which is shown as the formula 10.
min w(a,a') = min 1/2 [l.summation over (i=1)][l.summation over (j=1)] ([a.sup.*.sub.i] - [a.sub.i])([a.sup.*.sub.j] - [a.sub.j])J - [l.summation over (j=1)] ([a.sup.*.sub.i] - [a.sub.i])[y.sub.i] (10)
Similar samples as the model forecast, so it's natural to give a new sample are more close to the standard support vector samples with the greater weight. The general fundamental particle is equivalent to an equivalence class of Rough set, equivalence class has become equivalent grain, clustering operation is essentially defines the general equivalence relations between the samples. Belong to the same class of any two sample points is considered to be equivalent, may think they have the similar properties, in the current threshold scales is no different. In accordance with the above features, the principle of the algorithm is that with each data as the center, to calculate the similarity between objects, and then based on the similarity threshold partition initial equivalent relation of each object. Clustering algorithm based on rough set is easy to form makes little sense to have too much small size categories, so usually need to deal with in order to get the initial clustering results are more suitable granularity. So can obtained in general appropriate large-grained while maintaining better clustering accuracy of the algorithm.
3. The proposed methodology
3.1. PrefixSpan algorithm
PrefixSpan algorithm by adopting the idea of divide and conquer, first of all, find all frequent items, frequent items for each projection and create projection database, collection, and thus the projection database. Each prefix has a projection database and the recursive mining. PrefixSpan prefix algorithm by constructing model, and then will get frequent patterns, the connected with suffix mode so as to avoid the generating candidate. Prefixspan algorithm steps described as follows: (1) to scan once got all the frequent sequence database project n, namely all of the frequent sequences of length 1 set; (2) the complete set of frequent sequence is divided into n of the subset of the set of the frequent sequences with different prefixes; (3) by constructing the corresponding projection database and found in it recursively mining frequent sequence subset.
In constructing projection database, by examining the sequence database prefix on prefix, to avoid the frequent the same prefix in the database of projection repeat projection, reduce the number of times and projection, and save the time of repeat projection database structure and pattern in the projection database mining waste of time. At the same time, in the process of PrefixSpan algorithm implementation, due to the projection sequence number is less than the minimum support of basic projection database, support will not be more than the minimum number of the frequent items, so in IPMSP algorithm only scans the projection when the database is the frequent items, while by comparing the projection sequence number and the minimum support database contains, abandoned mining sequence number is less than the minimum support projection database, early exit projection database for further mining, reduce scanning impossible frequent sequence projection database, so as to improve the efficiency of algorithm implementation as figure 3.
3.2. Intrusion detect system (IDS)
In order to adapt to the needs of the development of the current network to satisfy the requirement of efficient intrusion detection algorithm in at the same time, to be able to discover new attack in time, and that need to input the parameters of the acquisition as easy as possible for as little as possible and analysis. To overcome the limitations of simply using an algorithm, integrated the advantages of all kinds of algorithms, this paper designed a new kind of clustering algorithm. The purpose of the framework is mainly focused on the listed aspects. (1) Data sharing, that is, through the provision of standard data format, and make the all kinds of data in the IDS can transfer and that sharing between different systems. (2) Improving the interoperability standards and to establish a set of development interface and the support tools, to provide independent development ability of component. (3) IDS Shared components, namely a IDS components can be used by IDS artifacts.
[FIGURE 3 OMITTED]
Based on this architecture, we will construct the proposed IDS from listed aspects.
1. Expert system, using expert system to test the invasion, often for intrusion behavior characteristics. The so-called rules that are the knowledge, the establishment of the expert system is dependent on the completeness of knowledge base, knowledge base of completeness and depends on the completeness of audit records and real time.
2. Keystroke Monitor, intrusion detection method is a simple, it through the analysis of the mode of user key sequences detection intrusion behavior, it can be used to host intrusion detection.
3. Prediction model, predict model generation is also a kind of method for anomaly detection, which is based on the assumption that the sequence of audit events are not random, but fits a pattern of identifiable. Compared with pure statistical methods, it increases the sequence of events and the analysis of the relationship between to detect the statistical method can detect abnormal events.
No matter what type of network attack behavior, it is on the base of the underlying network protocols, follow the protocol standard, at the same time there must be the related packet flowing through the Internet backbone bandwidth. Therefore, in the Internet backbone to monitor traffic is very necessary and very meaningful, analyze the birthplace of various network attack and attack types to take further measures. Usually some kind of agreement contains a number of data packets, in the process of connecting part can extract the characteristics of these packets. To each packet to extract the characteristics of the "statement", the "statement" feature collection constitutes the characteristics of some sort of agreement; Entire agreement feature set is composed of the overall characteristics of the library.
Here, we should define necessary parameters to achieve the systematic structure. Communications between two computers on the network, when communicating in a data message may contain one or more of the above different information, this information is called a word, remember to W. Here, the word is regarded as a basic concept in the application layer protocol, is also the basic unit of the application layer protocol analysis. Running on the application layer communication protocol by word to cognitive all kinds of different communication information. The feature can be then defined as the formula 11.
W = (Wn, Wl, Wi, Wb, Wm) (11)
Using the hierarchical clustering algorithm based on the distance to find out the N must carry on the clustering initial point, after the first clustering, clustering results as fuzzy C-average method of initial kind heart, and then use fuzzy C-average method of dynamic clustering, constantly adjust the kind heart, until the clustering satisfy given conditions precedent. Due to the mining algorithm of the relative independence of intrusion detection system, algorithm does not depend on the specific data and specific system, so the intrusion detection system based on data mining to data source requirement is very low. Detection algorithm is mainly divided into the following three steps.
1. First to establish a global agreement five yuan group as the hash value of the "session" hash tables, the hash table node points to save the session information and their corresponding agreement number.
2. TCP is divided into TCP header and the TCP data section. The IP address of the source and destination addresses have been extracted by IP first, in the TCP header is required to extract the source port and destination port the two equally important information, from TCP data section can obtain the complete content of a message information, characterized by network byte stream, and the main content of the protocol testing. This article mainly analyzes the header information and extracts its IP five group information.
3. When matching a score greater than 0, the characteristic of protocol search hash table, check whether have five unit is the set of nodes. If does not exist, then create a new node; Already exists, the node score updates, and check whether the node scores more than threshold, more than is reported to the node information session information as well as the corresponding protocol scores.
It will be proposed in this paper, based on the data flow of sequential pattern mining algorithms using Java language to implement. All code algorithms use the jar a language, the Eclipse 3. 1 compile environment. In the figure 4, we show the performance of the PrefixSpan algorithm. Curve the higher PrefixSpan, lower for other algorithm. Can be seen in the figure, when minimum support is very small, PrefixSpan running time running a lot less time than other algorithm, and the difference between the two is obvious. As the minimum support increased gradually, gradually narrow the gap between both. This is because, when the minimum support is large, the number of sequential patterns is very limited and length is very short, so it is difficult to embody the characteristics of the algorithm. The figure 5 and 6 show other test as well.
[FIGURE 4 OMITTED]
[FIGURE 5 OMITTED]
[FIGURE 6 OMITTED]
In this paper, we conduct theoretical analysis on the network security based on the PrefixSpan algorithm in data mining. Intrusion detection technology is the several key points in a network or computer system to collect information and analysis, find out whether there is a violation of the security policy and the signs of being attacked, timely report system with unauthorized access or anomalies. Supplement of the firewall, intrusion detection can accurate judgment intrusion, the intrusion response immediately and timely close service, even to cut off the link can identify from to this paragraph, all other network segments or external network attacks. We integrate the data mining algorithms to implement the IDS. The experimental result reflects the effectiveness of the methodology.
Aloysius, G., & Binu, D. (2013). An approach to products placement in supermarkets using PrefixSpan algorithm. Journal of King Saud University-Computer and Information Sciences, 25(1), 77-87.
Bach, F. R. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in neural information processing systems, 105-112.
Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615-637.
George, A., & Binu, D. (2012). DRL-Prefixspan: A novel pattern growth algorithm for discovering downturn, revision and launch (DRL) sequential patterns. Open Computer Science, 2(4), 426-439.
Goel, A., & Mallick, B. (2015). Customer Purchasing Behavior using Sequential Pattern Mining Technique. International Journal of Computer Applications, 119, 54-60.
Hegde, G., & Govilkar, S. (2012). Study on Parallel Algorithm for Sequential Pattern Mining. Proc Spie, 2, 657-670.
Hijawi, H. M., & Saheb, M. H. (2015). Sequence Pattern Mining in Data Streams. Computer and Information Science, 8(3), 64-68.
Jia-xin, L. I. U. (2012). An Interactive Sequential Patterns Mining Algorithm Based on Frequent Sequence Tree. Computer Technology and Development, 5, 10-17.
Kaur, D., & Kaur, R. (2014). Minimizing the repeated database scan using an efficient frequent pattern mining algorithm in web usage mining. Int J Res Advent Technol, 2(6), 2321-9637.
Kumar, A., & Kumar, V. (2013). MISP (Modified IncSpan+): Incremental Mining of Sequential Patterns. International Journal of Computer Applications, 65(8), 40-45.
Li, J., Hao, H., & Hao, F. (2014). The Prefix Span Algorithm Research of Synthetic Decision Support System Based on Internet of Things. In Computational Intelligence and Design (ISCID), 2014 Seventh International Symposium on, 1, 174-176.
Pereira, C., Ferreira, C. (2015). Identification of IT Value Management Practices and Resources in COBIT 5. RISTI--Revista Iberica de Sistemas e Tecnologias de Informacao, (15), 17-33.
Parimala, M., & Sathiyabama, S. (2012). SPMLS: An Efficient Sequential Pattern Mining Algorithm with candidate Generation and Frequency Testing. International Journal on Computer Science and Engineering, 4(4), 590-601.
Schweizer, D., Zehnder, M., Wache, H., Witschel, H. F., Zanatta, D., & Rodriguez, M. (2015). Using Consumer Behavior Data to Reduce Energy Consumption in Smart Homes: Applying Machine Learning to Save Energy without Lowering Comfort of Inhabitants. 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 1123-1129.
Song, W., & YANG, K. (2013). Discovering Sequential Patterns with Various Time Constraints. Journal of Computational Information Systems, 9(15), 6047-6054.
Wang, W. N., Li, T. S., Zhang, H. Y., & Chen, Q. F. (2013). Projection position-based sequential pattern mining algorithm. Applied Mechanics and Materials, 239, 1298-1302.
Wang, J., Wang, H., Zhou, Y., & McDonald, N. (2015, October). Multiple kernel multivariate performance learning using cutting plane algorithm. Systems, Man, and Cbernetics (SMC), 2015 IEEE International Conference, 1870-1875.
WANG, S. G., WANG, F. X., & SONG, Y. (2013). Chinese Comparative Sentences Identification Method Based on Sequential Patterns. Journal of Shanxi University (Natural Science Edition), 2, 8-12
Hao Zhang, Zhanxiang Ye *
Information Technology Department, Wenzhou Vocational & Technical College, Wenzhou, China
Figure 1--The statistical data of the information security forms Non-cyber (16.879) 25% Scans/probes attempted access (12.652) 19% Policy violation 17% (8.798) Equipment 13% (7.413) Malicious code 11% (3.967) Social engineering 6% Suspicious network activity 3% (2.198) Investigation 2% (1.035) Unauthorized access 2% Improper usage (760) 1% Phishing (194) 0.3% Denial of service (85) 0.1% Note: Table made from pie chart.
|Printer friendly Cite/link Email Feedback|
|Author:||Zhang, Hao; Ye, Zhanxiang|
|Publication:||RISTI (Revista Iberica de Sistemas e Tecnologias de Informacao)|
|Date:||Aug 1, 2016|
|Previous Article:||Study on the influencing factors of using agricultural products mobile E-business platform based on customer value theory.|
|Next Article:||Analysis of the impact of industrial cluster on urbanization based on data mining.|