# Customer segmentation in private banking sector using machine learning techniques.

1. IntroductionProfitable business development strategies always begin with good customer segmentation. This enables economic entities to identify specific characteristics of their products and services demanders, and elaborate efficient business plans. In the banking sector the process of customer segmentation has become a useful tool in gaining more customers, but also in extracting a higher value from the existing ones.

Market segmentation is a set of concepts and models that guides management thinking and leads to new profitable product/service offerings. The general assumptions are that customers with resembling characteristics have similar needs. Therefore, there is high interest in identifying homogeneous groups of existing customers or potential ones whose demand for distinct configured products sustains a promising market opportunity.

Experience has proven that an effective segmentation process eventually brings high benefits. However, before this happens, there are some important aspects that pre-condition the achievement of accurate models and often these relate to: variable selection and statistical methods chosen to perform the segmentation.

2. Binary classification methods

Data classification becomes each and every day an important problem, as the amount of information that is generated and needs to be classified becomes larger. Traditional statistical methods such as Discriminant Analysis and Logistic Models have been replaced in many cases by non-parametric techniques like Neural Networks and Support Vector Machines (henceforth SVMs). Thus, in recent years, machine learning has become one of the most promising alternatives to solve binary classification problems. SVMs have already proven good results in fields like medical diagnosis (Noble 2006), character recognition (Ahmad et al. 2004), and credit scoring (Auria, Moro 2008).

The present paper analyzes the predictive power of a private-banking segmentation model using both Neural Networks and SVMs.

2.1. Artificial Neural Networks

Artificial Neural Networks (ANNs) are computational methods inspired by the way in which the human brain functions. They are non-linear statistical data modeling techniques in which interconnected elements process the information simultaneously, having the ability to adapt and learn from past examples. Artificial Neural Networks have application in various fields such as: pattern recognition; medical diagnosis (Ultsch et al. (1995) used the capacity of neural network to diagnose acidosis diseases; Zhou et al. (2001) proposed an Artificial Neural Network ensemble in the process of lung cancer diagnosis; Kiyan and Yildirim (2003) used ANNs for breast cancer diagnosis); credit risk models (Amir (2001) developed a Neural Network bankruptcy prediction model; Baesens et al. (2003) used Neural Network rule extraction for credit risk evaluation); market prices changes (Yoon and Swales (1993) analyzed the forecasting power of Neural Networks for stock prices) etc.

First introduced by McCulloch and Pitts (1943), Neural Networks have passed through different stages over the following decades. One important development was made by Rosenblatt (1958) on alphanumerical character recognition when he presented the perceptron. Later, in their paper "Perceptron", Minsky and Papert (1969) managed to put to a doubt the modeling capacity of Neural Networks in solving linearly non-separable problems. No sooner than 15 years later, have Neural Networks started to regain strength through a number of papers outlining the capacity of multi-layered networks to solve non-linear problems.

Multilayer perceptrons (henceforth also referred to as MLPs) have three types of layers, namely one input layer, one or more hidden layers, and one output layer. The input layer handles the initial data features, while the output layer depends on the type of answer of the network. Knerr et al. (1992) have shown that one hidden layer usually manages to solve very well classification problems, but in extreme cases the number of hidden layers can go up to three (a higher number of hidden layers can lead to over-fitted models). However, there is no recipe in respect with the number of hidden neurons and most of the times, this is selected based on different trials and error comparisons.

Depending on the architecture of the MLPs, we distinguish feed-forward (Fig. 1) and feed-back networks (recurrent). The former permits the propagation of information only in one direction and never between neurons from the same layer, while the latter are more complex and signal can be processed several times within the same layer without moving to the next layer.

Back-propagation

The most popular and comprehensible type of feed-forward network is the "back-propagation" which was first proposed by Rumelhart et al. (1986). The most common back-propagation neural network uses the steepest descent (gradient descent) algorithm. According to this method, weights are adjusted in the direction corresponding to the negative gradient of the error surface. Nevertheless, the gradient does not indicate the global minimum, inducing the risk of getting blocked in a local minimum. This direction can be guided by two adjustable parameters such as learning rate and momentum rate. The first one is a parameter that controls the step size when weights are iteratively adjusted. Small learning rates generate slow convergence but impose some computational costs, while high values converge faster to the real solution, but might also overstep it or go into the wrong direction. The latter parameter, momentum rate, refers to the fact that previous changes in the weights should impact the current direction of movement in the weight space. However, the optimal values for these parameters depend on the type of problem one wants to solve and are obtained by making several tests.

Usually, the training process of a "back-propagation" algorithm involves a number of steps aimed to decrease the validation sample error.

Initially, one has to define the architecture of the neural network and select the appropriate input variables. Afterwards, the weights (w) and threshold values ([theta]) are initialized on a random basis. The training data and the output matrix are afterwards introduced (x and T respectively) in the model.

Then, for every neural node, one has to perform the output values. This is done both for the hidden layer (resulting thus the output vector H) and output layer (with output vector Y):

[u.sub.k] = [summation] [w.sub.ki][x.sub.i] - [[theta].sub.k], (1)

[H.sub.k] = f([u.sub.k]) (2)

for the hidden layer and:

[u.sub.j] = [summation] [w.sub.jk] [H.sub.k] - [[theta].sub.j], (4)

[Y.sub.j] = f([u.sub.j]) (4)

for the output layer,

where [x.sub.i] is the value of input variable i, [w.sub.ki] is the weight giving the intensity of the signal induced by input variable i to the hidden neuron k, [[theta].sub.k] is the threshold value, [u.sub.k] is the input for hidden unit k, [H.sub.k] is the output of the hidden unit k, f([u.sub.k]) is the activation function for the hidden unit k, [w.sub.jk] is the link from hidden neuron k to output layer j, [[theta].sub.j] is the threshold value, [u.sub.j] is the input for output unit j, [Y.sub.j] is the output for the final layer, and f([u.sub.j]) is the activation function for the output unit j.

Based on the outputs, the sum of squares is calculated:

SOS = [summation] [([Y.sub.j] - [T.sub.j]).sup.2], (5)

where SOS is the statistic sum of squares, performed as a deviation of the estimated output [Y.sub.j] from the real output value [T.sub.j].

The distances between the output and hidden layer are computed as follows:

[[delta].sub.j] = ([T.sub.j] - [Y.sub.j]) x f'([u.sub.j]) (6)

and

[[delta].sub.k] = ([[summation].sub.j][8.sub.[delta]][w.sub.jk]) x f'([u.sub.k]), (7)

where [[delta].sub.j] is the distance reached in the output layer j, and [[delta].sub.k] is the distance determined for the hidden layer k.

Then, the weights (w) and threshold values ([theta]) are modified accordingly in each type of layer:

[DELTA][w.sub.jk](n) = [eta][[delta].sub.j][H.sub.k] + [alpha][DELTA][w.sub.jk] (n - 1), (8)

[DELTA][[theta].sub.j](n) = -[eta][[delta].sub.j] + [alpha][DELTA][[theta].sub.j] (n - 1) (9)

for the output layer, and:

[DELTA][w.sub.ki](n) = [eta][[delta].sub.k][x.sub.i] + [alpha][DELTA][w.sub.ki](n - 1), (10)

[DELTA][[theta].sub.k](n) = -[eta][[delta].sub.k] + [alpha][DELTA][[theta].sub.k](n - 1) (11)

for the hidden layer,

where [DELTA][w.sub.jk](n) is the modification of the weight from hidden neuron k to output neuron j for the current step n, [DELTA][w.sub.jk] (n - 1) is the modification from the previous step n - 1, [DELTA][[theta].sub.j] (n) is the current modification for the threshold value, [DELTA][[theta].sub.j] (n -1) is the previous modification of the threshold, [alpha] is the momentum coefficient according to which past modifications in the weights should impact the current evolution of the weights, and [eta] is the learning rate which controls the size of the step when weights are iteratively adjusted. Analogous, equations (10) and (11) describe the adjustments for the weights used in the hidden layers.

Then, new w and [theta] are computed for the output layer:

[w.sub.jk](n) = [w.sub.jk](n - 1) + [DELTA][w.sub.jk], (12)

[[theta].sub.j](n) = [[theta].sub.j](n - 1) + [DELTA][[theta].sub.j] (13)

and for the hidden layer:

[w.sub.ki](n) = [w.sub.ki](n - 1) + [DELTA][w.sub.ki], (14)

[[theta].sub.k](n) = [[theta].sub.k](n - 1) + [DELTA][[theta].sub.k], (15)

where [w.sub.jk](n) is the adjusted weight for step n, giving the intensity that the value of hidden neuron k has on output neuron j and which is determined based on the value of the weight from the previous step ([w.sub.jk](n - 1)) and the adjustment calculated in equation (8). The new threshold value, [[theta].sub.j](n), is performed on a similar basis, adding the modification computed in equation (9) to the threshold value from the previous step, [[theta].sub.j](n - 1). For the hidden layer, the process is likewise.

This process, from equations (1) to (15), is repeated until the sought criterion in respect with error is met.

2.2. Support Vector machine

Support Vector machine (SVM) represents a concept, originating from the field of statistics and computer science, that describes a set of methods for data analysis and pattern recognition. The original idea and initial development of the method belongs to Vladimir Vapnik, and was introduced in 1963, and developed later in the 90's by Vapnik and his co-workers (Boser et al. 1992; Cortes, Vapnik 1995; Vapnik 1995). The standard SVM uses an input dataset and for each sample performs a binary classification. This behavior puts SVM method in the non-probabilistic binary linear classifier group. The initiator of the SVM model, Vapnik, proved its performance on a number of problems, most notable at that time being handwriting recognition.

The prime aspects related to SVMs are presented by Ayodele (2010: 25) (1): attributes which are predictor variables; features that represent transformed attributes that define the hyper-plane; vectors which are sets of features that describe one case; Support Vectors representing the vectors near the hyper-plane that have the role of bounding the hyper-plane (Fig. 2).

A Support Vector machine is a mathematical algorithm that maximizes the function which defines a particular set of data. As Noble (2006) claims, the main idea behind the SVM, relates to four basic concepts which will be further used in defining the model:

- Separating hyper-plane, a line which acts as a separator between the two classes. There can be more than one separating hyper-plane;

- Maximum margin hyper-plane, the hyper-plane that provides the highest distance between the separating hyper-plane and the nearest expression vectors;

- Soft margin, that allows wrongly classified cases by the separating hyper-plane;

- Kernel function, which transposes the data from a low-dimensional space into another one of a higher dimension.

However, projecting the data into very high-dimensional space, just so that the soft margin decreases, is not always suitable due to the "curse of dimensionality"; i.e. when the number of variables increases, the number of solutions rises at a higher speed and this makes the classification more difficult and the over-fitting phenomenon appears. When the model is over-fitted it is very likely that it will not perform well on an "out-of-sample" data.

Mathematical model

If a new object (individual) j should be classified into one of two classes (class1 or class2) following a linear SVM score, it would look like this:

[S.sub.j] = [x.sup.T.sub.j] w + b, (16)

where [x.sub.j] = ([x.sub.j1], [x.sub.j2], ..., [x.sub.jd]) is a vector with d variables; [x.sub.jk] is the value of variable k for object j, k = 1, ..., d; w is the vector containing the weights of the d variables, and b is a constant.

In order to develop such a model, a SVM learns on a training sample the values of parameters w and b. Geometrically, this means identifying a hyper-plane that performs as a separator between the two classes based on a certain criterion that is induced by margin maximization. This margin represents the distance between the hyper-planes which limit each class (in a perfect model no object is misclassified). By maximizing the margin, we identify the function that best separates objects in class1 from objects in class2. These two margin boundaries are: [x.sup.T]w + b = 1 and [x.sup.T]w + b = -1, and thus, the margin becomes 2/[parallel]w[parallel], where [parallel]w[parallel] is the norm of vector w.

When not all objects are correctly classified, the margin is said to be soft. Consider that [[epsilon].sub.i] is a variable for misclassifications ([[epsilon].sub.i] [greater than or equal to] 0). When [[epsilon].sub.i] = 0, then all objects are correctly classified. Otherwise, it means there are misclassifications, which creates the need of finding a criterion for minimizing these training errors and further determine the parameters w and b.

Using the constraint of having no object laying within the margin, except some classification errors, the following condition is defined:

[y.sub.i]([x.sup.T.sub.i]w + b) [greater than or equal to] 1 - [[epsilon].sub.i], i = 1, ..., n, (17)

where [y.sub.i] is the variable indicating the class of an object (-1 or 1).

Following, this becomes an optimization problem for the calculation of w and b.

[min.sub.w] 1/2 [[parallel]w[parallel].sup.2] + C [[summation].sup.n.sub.i][[epsilon].sub.i] (18)

with constraints (17) and [[epsilon].sub.i] [greater than or equal to] 0.

Relation (18) first maximizes the margin [2/[parallel]w[parallel]] by minimizing 1/2 [[parallel]w[parallel].sup.2], where [[parallel]w[parallel].sup.2] comes from the second term representing the sum of misclassification errors [[[epsilon].sub.i]/[parallel]w[parallel]] multiplied with parameter C. This leads to a convex quadratic problem, a maximization of the margin width with minimizing errors.

Parameter C weights classification errors and induces the generalization capacity of a SVM. The higher is C, the higher is the weight given to misclassifications, the lower the generalization of the machine is. Low generalization means that the machine may work well on the training set, but perform bad on new samples.

Bad generalization may be a result of over-fitting on the training sample, for example, in the case that this sample shows some untypical data structure. In order to reduce the risk of over-fitting a SVM on the training sample, one has to choose a low C. The smaller C is, the wider the margin is, and thus more and larger in-sample classification errors are permitted.

The above optimization problem can be solved using Lagrange function:

L(w, b, [epsilon], [alpha], v) = 1/2[w.sup.T]w + C[[summation].sup.n.sub.i][[epsilon].sub.i] + [[summation].sup.n.sub.i] [[alpha].sub.i] {[y.sub.i]([w.sup.T][x.sub.i] + b) - 1 + [[epsilon].sub.i]} - [[summation].sup.n.sub.(i = 1)][v.sub.i][[epsilon].sub.i], (19)

where [[alpha].sub.i] [greater than or equal to] 0 are the Lagrange multipliers for the constraint (17) and [v.sub.i] [greater than or equal to] 0 are Lagrange multipliers for constraint [[epsilon].sub.i] [greater than or equal to] 0.

By applying Kuhn-Tucker Theorem, the solution to this problem is the saddle-point of the Lagrangian, minimized in respect with w, b, and [[epsilon].sub.i], and at the same time maximized in respect with [alpha] and v. This can be reduced to a convex quadratic programming problem in [[alpha].sub.i]:

w = [[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i][x.sub.i], (20)

b = 1/2([x.sup.T.sub.+1] + [x.sup.T.sub.-1]) x w, (21)

where [x.sup.T.sub.+1] and [x.sup.T.sub.-1] are any two Support Vectors belonging to two different classes, which lie on the margin boundaries.

Those cases where [[alpha].sub.i] [not equal to] 0, are called Support Vectors and they are relevant for determining w. These are on the margin boundaries or within the margin for non-perfectly separable data. As Downs et al. (2001) say, the time taken for a Support Vector classifier to compute the class of a new pattern is proportional to the number of Support Vectors. Therefore, if that number is large, classification speed is slow.

Introducing (20) into (16), we obtain the score [S.sub.j] as a relation of the scalar product of the variables of object j to be classified, and the variables of the Support Vectors in the training sample, of [[alpha].sub.i], and of [y.sub.i]. By comparing [S.sub.j] with a benchmark value, we are able to estimate if object j has to be classified in class1 or class2:

[S.sub.j] = [[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i] <[x.sub.i], [x.sub.j]> + b. (22)

Kernel function

When the SVM is non-linear, the score of an object is determined by replacing the scalar product of variables with a kernel function K([x.sub.i], [x.sub.j]). Thus, equation (22) becomes:

[S.sub.j] = [[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i]K([x.sub.i], [x.sub.j]) + b. (23)

The performance of a SVM is highly related to the kernel function (Amari and Wu 1999). Nevertheless, there is no theory regarding the method selection of the kernel and in most cases this is done by making several trials and error comparisons. Some of the most popular types of kernel functions are: polynomial, sigmoid and RBF (radial basis function).

Kernels are symmetric, semi-positive definite functions which satisfy Mercer Theorem (2). If this theorem is satisfied, this ensures that there exists a (possibly) non-linear map [PHI] from the input space into some feature space, such that its inner product equals the kernel. The non-linear transformation [PHI] is only implicitly defined through the use of a kernel, since it only appears as an inner product.

K([x.sub.i], [x.sub.j]) = ([PHI]([x.sub.i]), [PHI]([x.sub.j])). (24)

The classification problem is solved this way: the input space is transformed by [PHI] into a feature space of a higher dimension, where it is easier to find a separating hyper-plane. Thus, the Kernel can solve the problem that data are non-linearly separable, by implicitly mapping them into a feature space, in which the linear threshold can be used. Using a kernel is equivalent to solving a linear SVM in a new higher-dimensional feature space. The non-linear SVM score is therefore a linear combination, but with new variables, which are derived through a kernel transformation of the prior variables. The score function does not have a compact functional form, depending on a transformation of the variables, which we do not know, since it is only implicitly defined. The solution of the constrained optimization problem for non-linear SVM is given by:

w = [[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i][PHI]([x.sub.i]), (25)

b = -1/2([[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i]K([x.sub.i], [x.sub.+1]) + [[summation].sup.n.sub.(i = 1)][y.sub.i][[alpha].sub.i]K([x.sub.i], [x.sub.-1])). (26)

According to (23) and (26), we do not need to know the form of the function [PHI] for the score calculation. For the calculation of the score (22), the input variables are used as a product. Thus, only the kernel function is needed in equation (26). As a consequence, [PHI] and w are not required for the solution of a non-linear SVM.

The Gaussian kernel is by far one of the most versatile Kernels. It is a radial basis function kernel, and is the preferred Kernel when we don't know much about the data we are trying to model:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (27)

The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully adjusted to the problem at hand. If it is overestimated, the exponential will act almost linearly and the higher-dimensional projection will start to lose its non-linear power. On the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

3. Models comparison

A SVM model using a sigmoid kernel function (3) is the same as a two-layer, perceptron neural network. Support Vector Machines are related to classical multilayer perceptron Neural Networks. Using a kernel function, SVM's are an alternative training method for polynomial, radial basis function and multi-layer perceptron classifiers in which the weights of the network are found by solving a quadratic programming problem with linear constraints, rather than by solving a non-convex, unconstrained minimization problem, as in standard neural network training.

Nevertheless, a significant advantage of SVMs is that while ANNs can suffer from multiple local minima, the solution to a SVM is global and unique. Other advantages of SVMs are that they have a simple geometric interpretation and give a sparse solution. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space. ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. The reason for which SVMs often outperform ANNs in practice is that they deal with the biggest problem of Neural Networks, i.e. SVMs are less prone to over-fitting. SVMs offer good results for out-of-sample generalizations, for properly set C and gamma parameters (when using Gaussian kernels). This makes them suitable to training sets that are biased at some degree.

As disadvantages, Auria and Moro (2008) mention their lack of transparency of the results, because scores for particular observations cannot be easily obtained. SVMs do not offer scores in the form of simple parametric functions, because the contribution of each parameter is not constant for all observation points. Other disadvantages identified when comparing SVMs with Neural Networks point to SVMs being only binary classifiers. This limits their usage, and imposes the creation of more complex models to solve multi-class problems for a set of observations.

4. Data sample and variables selection

The database consists of 2,783 observations representing active cardholders at an important commercial bank from Romania. Based on the transactions performed over the previous six months as related to 30th June 2011, we want to identify those customers that bring benefit to the bank through certain habits, so as to target them with other specific products (deposits, credit cards etc.). This means, we are interested in distinguishing the "affluent" customers from those with "mass" characteristics based on some specific behavior. For the classification model an a-posteriori status must be assigned to all customers and the used indicator was the average monthly sum of transactions. Those clients overcoming the 4,500 RON threshold were classified as "affluent", whereas the others are "mass" clients. On this basis, 336 individuals were labeled "affluent" and the remaining ones, "mass" customers.

Variable selection is a very important process when performing segmentation. Practically, different combinations of these characteristics will stay at the basis of the classification process and thus it is highly important to have relevant information. Frank, Massy and Wind (1972) defined two categories of variables, i.e. general descriptive customer characteristics and characteristics related to customer behavior, while Kotler (1998) presented four distinct classes of variables: geographical, demographical, psycho-graphical and behavioral.

For our data sample, information regarding customer transaction behavior within the past six months was processed on the basis of the available transaction history. The information regarding general characteristics of the customers was however very limited and eventually only one variable of this type was selected (4).

Since neither Neural Networks nor SVMs impose any assumptions and restrictions regarding the quality of the data, variable selection was performed on statistical basis but also on expert judgment. Based on this, 31 variables (out of which 30 continuous and one categorical) resulted, and these are presented in Table 1.

Fig. 3 presents a quadratic 3D representation of the customer status ("affluent"/" mass"") and two predictive variables, i.e. "Number of cards" and "Average monthly number of transactions". It is visible that the higher the number of cards and also the average monthly number of transactions are, the higher the chances are that the customer is "affluent" type.

5. Model development and results

5.1. Neural Networks

In the process of training a neural network a validation sample is required. This is because in order to avoid an over-fitted model which would eventually lead to a small error on development sample, the progress can also be checked against an independent set (validation sample). This will make the training process stop when the validation error stops decreasing. Thus, for building the neural network model, the data was split as follows: 60% for the training sample; 20% for the validation set; 20% for an out-of-sample testing process.

For the development process it was selected a classification Multilayer Perceptron using a back-propagation algorithm. This aims at reaching the minimum of the error function (sum of squares) in weight space, by using the method of gradient descent. This imposes the need of an activation function and we further chose the logistic function for both, hidden and output layers. After several tests on the momentum coefficient and learning rate a final value of 0.1 was assigned to both parameters because these two brought the best results in terms of detection rates on the test sample.

Because Neural Networks with one hidden layer perform well in most of the cases, for this customer segmentation research we selected one hidden layer as well. For this model we tested the performance evolution, as the number of hidden neurons varied between 8 (5) and 32 (the latter number represents the number of input neurons). The number of cycles per training process was set to a maximum of 200, and for each trial with different number of hidden neurons, 5 networks were trained, in the end resulting 125 networks. The first ten best results in terms of detection rates are summarized in Table 2.

The model classification was done by considering the best results on the test set ("out of sample" dataset), then on the validation set and finally on the training set, in this particular order. The best results were obtained after 191 cycles in case of MLP 32-8-2 with 8 hidden neurons, with a total detection rate on the test sample of 93.885%.

However, the detection rate for "affluent" customers is rather low, only 41% of the "affluent" customers being detected by the model (see Table 3).

5.2. SVM

The initial data sample was randomly split as follows: 80% for the development process (training) and 20% for the testing process ("out-of-sample" data). In case of SVM there is no need for a validation sample.

For the model development a Gaussian RBF Kernel function previously presented was chosen. This requires selecting the appropriate gamma (8) value which controls the threshold that separates the "affluent" from "mass"" customers. A higher gamma gives a higher separation between the two classes.

In order to choose the appropriate SVM model in terms of capacity (9) (C) and gamma, several trials were performed. To treat all continuous variables on the same gauge, these were scaled between 0 and 1. The maximum number of iterations was set to 1,000 and the stopping error at 0.001.

Initially, gamma was set to 0.05 and the capacity (C) was increased until the accuracy on the training or the test sample started to decrease. When C = 35, the model reached a maximum detection rate on both samples. Having chosen the capacity (C), gamma parameter was afterwards stressed in order to see the effect on the accuracy values. The model that brought the highest benefit in terms of detection was SVM_9, having gamma = 0.17 and C = 35. This model led to an overall detection rate of 98.832% on the training set and 97.127% on the test set (all results are presented in Table 4).

Even if SVM_8 has less bounded vectors, it is obvious that the detection on the test sample is smaller and this induces the feeling of an over-fitted model with very good performance on the development sample, but with low generalization capacity on out-of-sample data. Therefore, it is always indicated to select the model based on its accuracy on the test sample, and in this case, SVM_9 satisfies best this necessity.

Going further into details regarding the accuracy of the selected model (SVM_9), we observe that the accuracy on the "affluent" customers is somewhat smaller (83%) than the overall detection rate, but this can still be considered a good detection percentage for this class (Table 5).

6. Conclusions

Both machine learning techniques performed well in the segmentation process. However, even if the overall detection rate on the test sample differs from one approach to the other only by 2.34pps (10), SVM model using RBF kernel function clearly outruns the "affluent" detection of the MLP using gradient descent algorithm. This induces the need of using more advanced algorithms, like conjugate gradient or Broyden-Fletcher-Goldfarb-Shanno (BFGS), which can solve one of the major problems of gradient descent, i.e. the local minimum aspect.

Caption: Fig. 1. Feed-forward network

Caption: Fig. 2. Support Vector representation

Caption: Fig. 3. Status against two predictors - 3D representation

doi: 10.3846/16111699.2012.749807

Received 25 September 2012; accepted 12 November 2012

References

Ahmad, A. R.; Viard-Gaudin, C.; Khalid, M.; Yusof, R. 2004. Online handwriting recognition using Support Vector machine, volume A. IEEE, 311-314.

Amari, S.; Wu, S. 1999. Improving Support Vector machine classifiers by modifying kernel functions, Neural Networks 12: 783-789. http://dx.doi.org/10.1016/S0893-6080(99)00032-5

Amir, F. A. 2001. Bankruptcy prediction for credit risk using neural networks: a survey and new results, IEEE Transactions on Neural Netwoks 12(4): 929-935. http://dx.doi.org/10.1109/72.935101

Antkowiak, M. 2006. Artificial Neural Networks vs. Support Vector machines for skin diseases recognition. Master thesis in Computer Science, Umea University, Sweden.

Auria, L.; Moro, R. A. 2008. Support Vector Machines (SVM) as a technique for solvency analysis. German Institute for Economic Research, Berlin.

Ayodele, T. O. 2010. Types of machine learning algorithms, in Yagang Zhang (Ed.). New advances in machine learning. University of Portsmouth, United Kingdom, 19-46.

Baesens, B.; Setiono, R.; Mues, C.; Vanthienen, J. 2003. Using neural network rule extraction and decision tables for credit-risk evaluation, Management Science 49(3): 312-329. http://dx.doi.org/10.1287/mnsc.49.3.312.12739

Boser, B. E.; Guyon, I. M.; Vapnik, V. N. 1992. A training algorithm for optimal margin classifiers, in Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory. Pittsburgh, Pennsylvania: ACM Press. http://dx.doi.org/10.1145/130385.130401

Cortes, C.; Vapnik, V. 1995. Support-vector networks, Machine Learning 20(3): 273-297. http://dx.doi.org/10.1007/BF00994018

Downs, T.; Gates, K. E.; Masters, A. 2001. Exact simplification of Support Vector solutions. Journal of Machine Learning Research 12: 293-297.

Frank, R. E.; Massy, W. F.; Wind, Y. 1972. Market segmentation. Englewood Cliffs: Prentice-Hall.

Kiyan, T.; Yildirim, T. 2003. Breast cancer diagnosis using statistical neural networks, in International XII, Turkish Symposium on Artificial Intelligence and Neural Networks, University Besiktas, Istanbul, Turkey, 2003.

Knerr, S.; Personnaz, L.; Dreyfus, G. 1992. Handwritten digit recognition by neural networks with single-layer training, IEEE Transactions on Neural Networks 3(6): 962-968. http://dx.doi.org/10.1109/72.165597

Kotler, P. 1998. Marketing management. Ljubljana: Slovenska knjiga. 832 p.

McCulloch, W.; Pitts, W. 1943. A logical calculus of ideas immanent in neural activity, Bulletin of Mathematical Biophysics 5: 115-133. http://dx.doi.org/10.1007/BF02478259

Minsky, M.; Papert, S. 1969. Perceptrons. Cambridge: MIT Press.

Noble, W. S. 2006. What is a Support Vector machine?, Nature Biotechnology 24(12): 1565-1567. http://dx.doi.org/10.1038/nbt1206-1565

Rosenblatt, F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain, Psychological Review 65(6): 386-408. http://dx.doi.org/10.1037/h0042519

Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. 1986. Learning representation by back-propagation errors, Nature 323(9): 533-536. http://dx.doi.org/10.1038/323533a0

Ultsch, A.; Korus, D.; Kleine, T. O. 1995. Integration of Neural Networks and knowledge-based systems in medicine. Hans Meerwein-Stra[beta]le / Lahnberge, Marburg.

Vapnik, V. 1995. The nature of statistical learning theory. New York: Springer-Verlag. http://dx.doi.org/10.1007/978-1-4757-2440-0

Yoon, Y.; Swales, G. 1993. Predicting stock price performance: a neural network approach, in Neural Networks in finance and investing. Chicago, IL: Probus Publishing Company, 329-339.

Zhou, Z. H.; Jiang, Y.; Yang, Y. B.; Chen, S. F. 2001. Lung cancer cell identification based on artificial neural network ensembles. Nanjing University. Nanjing, China.

Ion Smeureanu (1), Gheorghe Ruxanda (2), Laura Maria Badea (3)

Faculty Cybernetics, Statistics and Economic Informatics, The Bucharest University of Economic Studies, 6 Piata Romana, 1st district, 010374 Bucharest, Romania

E-mails: (1) smeurean@ase.ro; (2) gheorghe.ruxanda@csie.ase.ro; (3) laura.maria.badea@gmail.com (corresponding author)

(1) Taiwo Oladipupo Ayodele, from University of Portsmouth from United Kingdom presented this in chapter 3 "Types of machine Learning Algorithms", from the book edited by Yagang Zhang "New Advances in Machine Learning", 2010.

(2) Let K: [R.sup.n] x [R.sup.n] [right arrow] R be given. Then for the K to be a valid (Mercer) kernel, it is necessary and sufficient that for any {[x.sup.(1)],[x.sup.(2)], ..., [x.sup.(m)]}, (m < [infinity]), the corresponding kernel matrix is symmetric positive semi-define.

(3) K([x.sub.i], [x.sub.j]) = tanh([alpha][x.sup.T.sub.i][x.sub.j] + r).

(4) Customer domicile (capital/province).

(5) Antkowiak (2006) proposes the following method when selecting the number of hidden neurons: [N.sub.h] = [square root of ([N.sub.i][N.sub.o])], where [N.sub.h] is the no. of hidden neurons, [N.sub.i] is the number of input neurons, and [N.sub.o] is the number of output neurons. Based on the configuration of the available data, the number of hidden neurons determined after the above formula would be 8.

(6) Gradient descent.

(7) Sum of Squares.

(8) gamma = 1/2[[sigma].sup.2].

(9) This is the weight given to misclassifications. The higher is C, the lower is the generalization capacity and therefore the over-fitting phenomenon appears.

(10) Percentage points.

Ion SMEUREANU has graduated the Faculty of Planning and Economic Cybernetics in 1980, as leader. He holds a PhD diploma in "Economic Cybernetics" from 1992 and has a remarkable didactic activity since 1984, when he joined the staff of Bucharest Academy of Economic Studies. Currently, he is a full Professor of Economic Informatics within the Department of Economic Informatics and the dean of the Faculty of Cybernetics, Statistics and Economic Informatics from the Bucharest University of Economic Studies. He is the author of more than 16 books and an impressive number of articles on economic modeling and computer applications. He was also project director or member in many important research projects. He was awarded the Nicolae Georgescu-Roegen diploma, the award for the entire research activity offered by the Romanian Statistics Society, General Romanian Economist Association Excellence Diploma and many others.

Gheorghe RUXANDA is a PhD in Economic Cybernetics, Editor-in-chief of ISI Thompson Reuters Journal "Economic Computation and Economic Cybernetics Studies and Research" and Director of Doctoral School of Economic Cybernetics and Statistics. He is full Professor and PhD Adviser within the Department of Economic Informatics and Cybernetics, The Bucharest Academy of Economic Studies. He graduated from the Faculty of Economic Cybernetics, Statistics and Informatics, Academy of Economic Studies, Bucharest (1975) where he also earned his Doctor's Degree (1994). Had numerous research visits in Columbia University - School of Business, New York, USA (1999), Southern Methodist University (SMU), Faculty of Computer Science and Engineering, Dallas, Texas, USA (1999), Ecole Normale Superieure, Paris, France (2000), Reading University, England (2002), North Carolina University, Chapel Hill, USA (2002). He is full professor of Multidimensional Data Analysis (Doctoral School), Data Mining and Multidimensional Data Analysis (Master Studies), Modeling and Neural Calculation (Master Studies), Econometrics and Data Analysis (Undergraduate Studies). Fields of Scientific Competence: evaluation, measurement, quantification, analysis and prediction in the economic field; econometrics and statistical-mathematical modeling in the economic-financial field; multidimensional statistics and multidimensional data analysis; pattern recognition, learning machines and neural networks; risk analysis and uncertainty in economics; development of software instruments for economic-mathematical modeling. Scientific research activity: over 35 years of scientific research in both theory and practice of quantitative economy and in coordinating research projects; 50 scientific papers presented at national and international scientific sessions and symposia; 64 scientific research projects with national and international financing; 77 scientific papers published in prestigious national and international journals in the field of economic cybernetics, econometrics, multidimensional data analysis, microeconomics, scientific informatics, out of which eleven papers being published in ISI - Thompson Reuters journals; 18 manuals and university courses in the field of econometrics, multidimensional data analysis, microeconomics, scientific informatics; 31 studies of national public interest developed within the scientific research projects.

Laura Maria BADEA is a PhD candidate in Economic Cybernetics at the Bucharest Academy of Economic Studies, has an MA in Corporate Finance (2010) and graduated the Faculty of Finance, Insurance, Banking and Stock Exchange from Bucharest Academy of Economic Studies (2008). Fields of scientific interest: machine learning and other modeling techniques used for classification matters in economic and financial domains, with a focus on artificial neural networks. Scientific research activity: one published article in ISI Thompson Reuters Journal.

Table 1. List of predictors Variables 1 maximum transaction value within a month 2 maximum transaction value within one single day 3 maximum no. of transactions within one single day over the past six months 4 maximum no. of transactions within one month 5 maximum transaction value at other than ATM 6 maximum transaction value 7 average monthly no. of transactions 8 maximum transaction value at ATM 9 proportion of POS transactions within overall transactions by no. 10 proportion of ATM transactions within overall transactions by no. 11 proportion of transactions made in other county than the domicile within overall transactions by no. 12 proportion of transactions in other county than domicile within total transactions by no. 13 minimum transaction value at other than ATM 14 no. of cards 15 proportion of foreign countries transactions within overall transactions by no. 16 minimum transaction value at ATM 17 sum of transactions performed in hotels 18 no. of transactions in hotels 19 customer domicile (capital/province)--binary categorical variable 20 sum of transactions at jewelry/watches/expensive glass stores 21 no. of transactions at jewelry/watches/expensive glass stores 22 sum of e-banking transactions 23 proportion of e-banking transactions within overall transactions by no. 24 no. of transactions in book stores 25 no. of transactions at expensive restaurants 26 no. of transactions for non-low-cost airlines flights. 27 proportion of Bucharest transactions within total transactions by no. 28 proportion of other types of transactions within overall transactions by no. 29 minimum transaction value 30 proportion of ATM transactions within overall transactions by value 31 proportion of Retailers transactions within overall transactions by no. Table 2. Trials for MLP selection Network name Training Validation Test perf. Training perf. % perf. % % algorithm MLP 32-8-2 93.297 91.007 93.885 GD6 191 MLP 32-8-2 93.058 91.007 93.885 GD 193 MLP 32-9-2 92.879 90.647 93.705 GD 177 MLP 32-10-2 92.819 90.647 93.345 GD 171 MLP 32-8-2 92.280 90.288 92.806 GD 114 MLP 32-12-2 92.220 90.288 92.806 GD 119 MLP 32-12-2 92.160 90.288 92.806 GD 127 MLP 32-13-2 92.160 90.288 92.806 GD 125 MLP 32-9-2 92.220 90.288 92.626 GD 123 MLP 32-19-2 92.160 90.288 92.626 GD 145 Network name Error Hidden Output function activation activation MLP 32-8-2 SOS7 Logistic Logistic MLP 32-8-2 SOS Logistic Logistic MLP 32-9-2 SOS Logistic Logistic MLP 32-10-2 SOS Logistic Logistic MLP 32-8-2 SOS Logistic Logistic MLP 32-12-2 SOS Logistic Logistic MLP 32-12-2 SOS Logistic Logistic MLP 32-13-2 SOS Logistic Logistic MLP 32-9-2 SOS Logistic Logistic MLP 32-19-2 SOS Logistic Logistic Table 3. Confusion matrix for Test sample using MLP 32-8-2 Predicted mass affluent Observed mass 100% 0% affluent 59% 41% Table 4. Trials for SVM selection Trial C gamma Support Support Bounded Train Vectors (0) Vectors (1) perf. (%) SVM_1 10 0.05 138 140 256 97.35 SVM_2 20 0.05 117 124 216 97.889 SVM_3 30 0.05 109 116 192 98.113 SVM_4 40 0.05 104 110 177 98.068 SVM_5 35 0.05 105 113 218 98.158 SVM_6 35 0.1 93 106 147 98.428 SVM_7 35 0.15 86 111 123 98.518 SVM_8 35 0.2 87 104 111 98.832 SVM_9 35 0.17 86 103 119 98.832 Trial Test perf. (%) SVM_1 96.23 SVM_2 96.589 SVM_3 96.768 SVM_4 96.768 SVM_5 96.768 SVM_6 96.948 SVM_7 97.127 SVM_8 96.948 SVM_9 97.127 Table 5. Confusion Matrix for Test sample using SVM_9 Predicted mass affluent Observed mass 99% 1% affluent 17% 83%

----------

Please note: Illustration(s) are not available due to copyright restrictions.

Printer friendly Cite/link Email Feedback | |

Author: | Smeureanu, Ion; Ruxanda, Gheorghe; Badea, Laura Maria |
---|---|

Publication: | Journal of Business Economics and Management |

Article Type: | Report |

Geographic Code: | 1USA |

Date: | Nov 1, 2013 |

Words: | 7110 |

Previous Article: | Relationship between retention factors and affective organisational commitment among knowledge workers in Malaysia. |

Next Article: | How to invest in Belgian shares by multimoora optimization. |

Topics: |