# A Unified Factors Analysis Framework for Discriminative Feature Extraction and Object Recognition.

1. Introduction

Feature extraction and selection are both a critical and challenging component of pattern recognition [1, 2]. Some basic principles can be used to guide the design of a feature extractor; however, the design work is essentially an experimental science depending on its field and problem being addressed. The utilization of algorithms for dimensionality reduction in supervised or unsupervised learning tasks has attracted a lot of attention in recent years. The relative simplicity and effectiveness of the linear algorithms for both Principal Component Analysis (PCA) [3] and Linear Discriminant Analysis (LDA) [3, 4] have made these two algorithms very attractive. The Locality Preserving Projections (LPP) [5] and Discriminant Canonical Correlations (DCC) [6] are other linear algorithms that have been proposed for dimensionality reduction. Three other recently developed algorithms--termed ISOMAP [7], LLE [8], and Laplacian Eigenmap [9]--can be utilized when conducting nonlinear dimensionality on a dataset that lies on or around a lower-dimensional manifold. Additionally, in order to extend the linear dimensionality reduction algorithms to nonlinear ones, the kernel trick approach [10] has also been applied to perform linear operations on higher or infinite dimensional features which are transformed by a kernel mapping function. What is more, a number of algorithms [11-14] have recently been proposed to carry out dimensionality reduction on objects encoded as matrices or tensors of an arbitrary order.

All the aforementioned algorithms have been designed according to specific intuitions. Solutions are given through the optimization of both intuitive and pragmatic objectives, with the aim of extracting low-dimensional, simplified, and robust features from the high-dimensional, redundant, disturbed, and distorted original data. Content and style factors can be seen as the two components of an object [15]. During pattern recognition field, the information explored as base of recognition is defined as content factor and the information disturbed recognition is defined as style factor. Two examples are given below. One is for face recognition. A front face with neutral expression, no occlusion, and normalized illumination represents the content factor of a face image, whereas a face with variation of poses, illumination, and expressions represents the style factor, as shown in Figure 1. The occurrence of interference between content and style factors necessitates their separation in order to achieve accurate classification capabilities. The other is for speech recognition. In the process of recognizing the text of speech, the style factor includes timbre, tone, and emotion of the speaker. These elements should be regarded as background noise and minimized, as they do not provide any useful information related to speech text. However, the situation is opposite for speaker recognition since the favorable element is timbre information, which is defined as content factor here. The influence of speech text which is defined as style factor should be minimized. Thus the primary objective of the linear, nonlinear, manifold, or tensor decomposition methods is to reduce the style factors of the object.

In this paper, we present a general framework called factors analysis, along with its linearization, kernelization, and tensorization, which offers a unified view for understanding and explaining many of the popular dimensionality reduction algorithms such as the ones mentioned above. The main objective of factor analysis is to find the essential low-dimensional features by simultaneously minimizing the style factors and maximizing the content factors of the object through the method of projection. There are two vital steps in the factor analysis framework. The first is to construct a factor separating objective function which includes the construction of a partition and weight matrix. The second step is to design the space mapping function. Using the factor analysis framework as a platform, we develop two novel dimensionality reduction algorithms, FA-LDA and FA-LPP. FA-LDA is to overcome the limitations of LDA, which solves the problem that the intraclass and interclass scatter matrix in LDA are not optimal through setting proper weights. FA-LPP is to overcome the limitations of LPP, which improves the design of partition so that the objects not only preserve the locality similarity but also have abundant identification information.

2. Factors Analysis: A General Framework for Feature Extraction

2.1. Factors Analysis. Content and style factors can be seen as two independent elements which influence the object and determine the observation [15], as mentioned before. The information contributed to recognition is defined as content factor and the negative information is style factor during pattern recognition.

For a general classification problem, the sample set is represented as X = [[x.sub.1], [x.sub.2], ..., [x.sub.N]], [x.sub.i] [member of] [R.sup.m], where N denotes the sample number and m denotes the feature dimension. In practice, the feature dimension m is very high usually, including redundancy and interference information. In order to alleviate the curse of dimensionality, it is necessary to transform the original data X from a high-dimensional space to a low-dimensional space which is represented as Y = [[y.sub.1], [y.sub.2], ..., [y.sub.N]], [y.sub.i] [member of] [R.sup.m']. The essence of dimensionality reduction is to find a mapping function F : x [right arrow] [??] that transforms x [member of] [R.sup.m] into appropriate low-dimensional space [??] [member of] [R.sup.m'], where typically m [much greater than] m':

[??] = F(x). (1)

According to the various cases, the mapping function can be explicit or implicit and linear or nonlinear. A dimensionality reduction algorithm based on factor analysis is proposed in this paper, which is different from traditional methods. The objective of the dimensionality reduction algorithm based on factor analysis is to find a low-dimensional dataset Y from the original dataset X, in which the style factor is suppressed as much as possible. The objective function is represented as follows:

[mathematical expression not reproducible], (2)

where [F'.sub.c], [F'.sub.s] are the content and style variance matrices after mapping (factors separating), respectively, C is the total number of partitions, [N.sub.i] is the sample number of the ith partition, [p.sub.i] is the prior probability of the ith partition (including the ith class for supervised methods and the ith local manifold space for unsupervised methods, generally named as the ith partition), [[mu]'.sub.i] is the mean vector of the ith partition after factors separating, [y.sup.(i).sub.j], [y.sup.(i).sub.k] are the jth and kth sample vectors of the ith partition, respectively, after factors separating, [W.sup.c.sub.ij], [W.sup.s.sub.jk] are weight matrices of content variance and style variance, respectively, and d is the constraint constant, which is different in different cases. Factor analysis criteria make the content variance of the objects between partitions as high as possible and the style variance of the objects within a partition as low as possible. In order to eliminate the style factor of the objects maximally, the style factor should be minimized and the content factor should be maximized after mapping transformation.

2.2. General Framework for Feature Extract. There are two vital steps for feature extraction based on factor analysis framework; one is the design of factor separating objective function, mainly including the design of partition and weight matrix, and the other is the design of space mapping function. Partition is designed based on the class information with supervised methods or local neighborhood through unsupervised or semisupervised methods of the samples. Weight matrix is designed according to the classification performance for purpose of finding the optimal distribution scheme, which is biased to the weakly separated samples in the case without sacrificing the strongly separated samples. Space mapping function is designed to transform the samples from input space to feature space. Thus, the research of factor analysis framework can proceed from the following two aspects: the first is to propose different space mapping transformation aimed at different applications, and the second is to find the design scheme of factor separating objective function that matches the pattern recognition best.

2.2.1. Space Mapping Transformation. There are three categories of space mapping transformation algorithms, which are linear space, kernel space, and tensor space mapping transformation, respectively.

(a) Linear Space Mapping Transformation. The relationship of datasets X and Y is supposed to be linear in linear space mapping transformation; namely, [y.sub.i] = [[psi].sup.T][x.sub.i]. Thus the objective function (see (2)) is converted to

[mathematical expression not reproducible] (3)

where [[mu].sup.0.sub.i] is the mean vector of the ith partition in original space, [x.sup.(i).sub.j], [x.sup.(i).sub.k] are the jth and kth sample vectors of the ith partition samples in original space, and [F.sub.c], [F.sub.s] are the content and style variance matrices in original space, respectively, defined as follows:

[mathematical expression not reproducible]. (4)

(b) Kernel Space Mapping Transformation. The direct way to extend methods from linear projections to nonlinear cases is to utilize the kernel trick. The intuition of the kernel methods is to map the data from the original input space to a higher-dimensional Hilbert space (kernel space) as [phi] : x [right arrow] H. The content and style variance matrices [F.sub.c], [F.sub.s] in kernel space H are defined as follows:

[mathematical expression not reproducible], (5)

where [mathematical expression not reproducible] is the mean vector of the ith partition samples in kernel space H and [phi]([x.sup.(i).sub.j]) is the kernel mapping to [x.sup.(i).sub.j].

(c) Tensor Space Mapping Transformation. Tensorization is another extension of linear methods, and it is a multiple linear extension. Firstly some terminologies [16] on tensor operations are reviewed before introducing the tensor space mapping transformation.

Tensor is the extension of vector and matrix. Tensor [mathematical expression not reproducible] represents [m.sup.r] components in m-dimension space, where every component is a function of coordinates, linearly changing with the coordinates following certain rules. r is the order of the tensor. When r = 1 A is simplified to a vector and when r = 2 A is simplified to a matrix. Expanding the tensors can facilitate its matrix representation. The k-order expansion of tensor A is [mathematical expression not reproducible]. The k-mode product between a tensor A and a matrix [mathematical expression not reproducible] is defined as B = [A.sub.xk] U, or [B.sub.(k)] = U[A.sub.(k)] for the form of tensor expansion.

Reform the original high-dimensional samples as the tensor [mathematical expression not reproducible]. Corresponding low-dimensional tensors as (6) are obtained from high-dimensional tensors through projecting (multiple linear projection), which is similar to linear methods:

[y.sub.i] = [X.sub.i][x.sub.1][[psi].sup.1][x.sub.2][[psi].sup.2] ... [x.sub.r][[psi].sup.r]. (6)

Then the objective function (see (2)) is converted to

[mathematical expression not reproducible]. (7)

2.2.2. The Design of Factor Separating Objective Function. The design of factor separating objective function is mainly about separating the content and style factors of objects, including the design of partitions and weight matrix.

There are two schemes for partition design recently; one is based on supervised algorithms, such as LDA, in which the samples are partitioned according to the class information of samples; the other is based on unsupervised algorithms, such as LPP, in which the samples are partitioned according to the k-nearest neighborhood of samples.

Variant weighting schemes have been proposed for different cases, which can be classified into four categories according to the weighting algorithms for scatter matrix between classes (content variance matrix [F.sub.c]) in supervised face recognition methods based on weighted Linear Discriminant Analysis [17].

(1) Euclidean Distance Algorithm [17]. Consider

W(i, j) = [1/[(ed).sup.2]], (8)

where ed is the Euclidean distance between the mean vectors of two classes.

(2) Estimating Bayesian Error Rate Algorithm [18]. Consider

w(i, j) = [1/2[[DELTA].sup.2.sub.ij]] erf ([[[DELTA].sub.ij]/2[square root of 2]]), (9)

where [[DELTA].sub.ij] is the Mahalanobis distance between the mean vectors of two classes and [mathematical expression not reproducible] dt is the error function.

(3) Confusing Information Algorithm [19]. Consider

[mathematical expression not reproducible], (10)

where [E.sub.ij] is the number of samples which are misclassified into class j but belong to class i actually and [n.sub.i] is the sample number of the ith class.

(4) Grey Correlation Analysis Algorithm Proposed in This Paper. Consider

w(i, j) = [1/n][n.summation over (k=1)][gamma]([[mu].sub.i](k), [[mu].sub.j](k)), (11)

where n is the vector dimensionality and [[mu].sub.i](k), [[mu].sub.j](k) are the kth feature in the mean vectors of the ith and jth partitions. [gamma]([[mu].sub.i](k), [[mu].sub.j](k)) is the correlation coefficients of the kth feature in the mean vectors of the ith and jth partitions, defined as follows:

[mathematical expression not reproducible], (12)

The weighting scheme about the style variance matrix [F.sub.s] gets little attention recently. According to the weighting scheme in LPP method, a new weighting scheme for [F.sub.s] based on thermonuclear similarity is proposed in this paper:

[mathematical expression not reproducible]. (13)

It deserves deep research on how to distribute the weights effectively and reasonably. The optimal weighting scheme will never abandon any object that may be classified correctly and never waste any weight value. In other words, it will help all the objects which need help as much as possible in order to achieve a win-win situation.

2.3. Unifying Various Algorithms of Feature Extraction. The relationship between several classical subspace dimensionality reduction algorithms and factor analysis framework is discussed in this section. Actually, most subspace algorithms that are commonly used recently can be represented and explained by factor analysis framework. After unifying current subspace algorithms in framework, it is obvious that these algorithms are just different in the basis of factor construction and constraint of factor separation, though the motivation or objective of these algorithms is varying. Next we will introduce the relationships between factor analysis and some typical methods of dimension reduction based on subspace.

2.3.1. The Relationship with LDA. The objective of LDA is to find the optimal projection direction of classification that maximizes the ratio between the intraclass and interclass scatter matrix:

J([psi]) = arg max [[trace ([[psi].sup.T][S.sub.b][psi])]/[trace ([[psi].sup.T][S.sub.w][psi])]], (14)

[S.sub.b] = [C.summation over (i=1)][p.sub.i]([[mu].sup.0.sub.i] - [[mu].sup.0])[([[mu].sup.0.sub.i] - [[mu].sup.0]).sup.T], (15)

[S.sub.w] = [C.summation over (i=1)][p.sub.i][[N.sub.i].summation over (j=1)]([x.sup.0.sub.j] - [[mu].sup.0.sub.i])[([x.sup.0.sub.j] - [[mu].sup.0.sub.i]).sup.T], (16)

where C is the class number, [p.sub.i] is the prior probability of class i, [[mu].sup.0.sub.i] is the mean vector of class i before projection, [[mu].sup.0] is the overall mean vector before projection, and [N.sub.i] is the sample number of class i. Refer to (3) and (4), set the weight matrix of [F.sub.c] and [F.sub.s] as all 1 matrix, and then

[mathematical expression not reproducible]. (17)

After further conversion,

[mathematical expression not reproducible]. (18)

Substitute (17) into (14); then

J([psi]) = arg max [[trace ([[psi].sup.T][S.sub.b][psi])]/[trace ([[psi].sup.T][S.sub.w][psi])]] = [[trace ([[psi].sup.T][F.sub.c][psi])]/[trace ([[psi].sup.T][F.sub.s][psi])]]. (19)

Thus, LDA is a particular case of factor analysis framework when the entire weight matrix is 1 and the basis of partition is class information. The content factor is the accumulation of mean variance among each class and the style factor is the variance of each sample in the same class.

2.3.2. The Relationship with MFA [20] and DLA [21]. In order to overcome the hypothesis of LDA that the samples should be of Gaussian distribution, MFA redefines the scatter matrix within class and between classes of LDA, and the objective function is as follows:

[mathematical expression not reproducible], (20)

where c is the class, [mathematical expression not reproducible] is a data pairs set that are the [k.sub.2]-nearest pairs between two different classes c and c', and [mathematical expression not reproducible] indicates the index set of the [k.sub.1]-nearest neighbors of the sample i within the cth class.

It can be concluded that MFA is also a particular case of factor analysis framework; only it considers both the class information and nearest neighborhood of the samples for the partition design. The numerator of (20) can be rewritten as follows:

[mathematical expression not reproducible], (21)

where [[mu].sup.0.sub.i], [[mu].sup.0.sub.j] are the mean vector of the [k.sub.2]-nearest pairs between the ith and jth classes. Therefore, the interclass separability can be seen as the content factor in factor analysis framework. Though the class information and nearest neighborhood are considered in the design of interclass separability, the weight matrix is ignored.

Similarly, the denominator of (20) can be rewritten as follows:

[mathematical expression not reproducible]. (22)

Therefore, the intraclass compactness of MFA can be seen as the style factor in factor analysis framework. Interclass separability is conducted based on the k-nearest neighbor relationships of the samples from the same class, disregarding the design of weight matrix.

For DLA, the design criterion of part optimization is that the distance of the nearest neighbor image is minimized within class and maximized among classes. The objective function of DLA is equal to (20), and thus DLA is a particular case of factor analysis framework also.

2.3.3. The Relationship with LPP. Refer to (3); [mathematical expression not reproducible]. Assume that all the prior probability [p.sub.i] is identical, and then the content and style variance matrix can be represented as follows:

[mathematical expression not reproducible], (23)

where [F.sub.s] can be represented as follows:

[mathematical expression not reproducible]. (24)

Partition the samples according to k-nearest neighbors; then (24) can be written as

[F.sub.s] = [N.summation over (i)][N.summation over (j)][[parallel][x.sub.i] - [x.sub.j -[x.sub.j]][parallel].sup.2][w.sub.ij], (25)

where W [member of] [R.sup.NxN] is the relation matrix weighted by the heat kernels: w(i,j) = exp(-[[parallel][x.sub.i] - [x.sub.j][parallel].sup.2]/t) if [x.sub.i] is one of the k-nearest neighbors of [x.sub.j] or [x.sub.j] is one of the k-nearest neighbors of [x.sub.i]; otherwise it is 0, and t is a tuning parameter.

Thus (3) can be represented as follows:

[mathematical expression not reproducible], (26)

where D is diagonal matrix, whose elements are the sum of the column or row elements in symmetric matrix S, represented as [D.sub.ij] = [[summation].sub.j][S.sub.ij], and L is Laplace operator; L = D - S. It can be seen that LPP is also a particular case of factor analysis framework. The style factor is the weighted accumulation of sample variance in the same neighborhood.

2.3.4. The Relationship with PCA. PCA is a commonly used algorithm for dimensionality reduction in pattern recognition, the objective of which is to find the best representative mapping space of the original data in the minimum mean square sense. The objective function to the dataset X = [{[x.sub.i]}.sup.n.sub.i=1] [member of] [R.sup.m] is

[mathematical expression not reproducible], (27)

where [SIGMA] is the covariance matrix:

[mathematical expression not reproducible], (28)

where [mu] is the mean value of all the samples; reformulate [SIGMA] in different index, and then

[mathematical expression not reproducible]. (29)

Change the summation order in (29); then

[mathematical expression not reproducible]. (30)

[mathematical expression not reproducible]. (31)

Set [W.sup.s.sub.ij] to be an n x n matrix in which each element is equal to 1/2n; then style variance matrix [F.sub.s] can be represented as follows:

[mathematical expression not reproducible]. (32)

Treat each sample as an independent partition; then

[F.sub.s] = [1/2n][n.summation over (i=1)][n.summation over (j=1)]([x.sub.i] - [x.sub.j])[([x.sub.i] - [x.sub.j]).sup.T] = [SIGMA]. (33)

Thus (3) can be represented as follows:

[mathematical expression not reproducible]. (34)

It can be seen that PCA is also a particular case of factor analysis framework. The style factor of PCA is the accumulation of all the sample variance.

2.4. Related Works and Discussions. Feature extraction methods based on subspace have got a lot of attention. Unifying all the current subspace analysis methods is convenient for comprehending the essence of these methods and comparing the variance of different methods. Various frameworks have been proposed in order to deeply mine the commonality and individuality of current feature extraction methods and provide a novel platform for proposing new methods. In 2007, Yan et al. have proposed the Graph Embedding (GE) framework [20], which describes most current subspace methods and presents that the variance of linear subspace methods is expressed in the graph structure difference, and the variance between linear methods and kernel methods is expressed in different way to embed. In 2009, [21] proposed a Patch Alignment (PA) framework for the motivation of dimensionality reduction. In the same year a universal learning framework of supervised subspace [22] has been proposed, which is based on the fundamental concept of discriminant analysis and evaluates the classification capacity of the features through the ratio between the intraclass compactness and interclass separability after projecting. It proves that LDA, LPP, and NPE algorithms are all particular cases of that framework, and the essence of these algorithms, no matter whether supervised or unsupervised, is to minimize the intraclass compactness while maximizing the interclass separability, only different in the process of sample classes.

We propose the framework based on factor analysis through deep research on subspace methods and current unified framework. Factor analysis framework is superior to other frameworks in the following aspects:

(1) The classification meaning is clearer and the design is more flexible. The content and style factors in factor analysis are various depending on the different applications, so kinds of design schemes of the two factors can be proposed by researchers. The intraclass compactness and interclass separability in the universal learning framework of supervised subspace proposed in [22] are exactly the content and style factors in factor analysis framework. The part optimization design in [21] is the process of factor separation. Meanwhile, factor analysis criterion provides flexible weights allocation plan to research.

(2) Factor analysis framework is adapted to feature extraction and dimensionality reduction methods which are commonly used currently, including linear, nonlinear, and tensor decomposition methods. Supervised and unsupervised feature extraction methods are also included.

(3) The factor analysis framework is designed for classification, so superior feature extraction methods can be designed through this platform. The FA-LDA algorithm presented in Section 3.1 and FA-LPP algorithm presented in Section 3.2 are both improved feature extraction methods proposed using factor analysis framework.

3. FA-LDA and FA-LPP

3.1. FA-LDA: Weight Design Based on Factor Analysis. In addition to encompassing most popular dimensionality reduction algorithms, the proposed framework can also be used as a general platform to design new algorithms for dimensionality reduction. With this framework, we develop a new dimensionality reduction algorithm to avoid certain limitations of the traditional Linear Discriminant Analysis in terms of suboptimal problem.

Researchers for LDA are mainly focused on the Fisher criterion solution and small sample problems but little on the Fisher criterion itself. Actually, Fisher criterion is suboptimal for the classification problem with K (K > 2) classes. Marco Loog is one of the earliest researchers that pointed out this problem and proposed an improved weighting algorithm based on Bayesian error rate function [18]. In the meanwhile, Li et al. proposed an improved weighting algorithm based on Euclidean distance at the same year [17]. After that, fractional-step LDA is proposed in [23] to solve the suboptimal problem of Fisher criterion. Fractional-step LDA is effective but time consuming since it is an iterative algorithm. In this section, we improve the Fisher criterion based on factor analysis in order to solve its suboptimal problem.

Actually, the content factor [F.sub.c] and style factor [F.sub.s] in factor analysis criterion are identical to interclass and intraclass scatter matrices [S.sub.b], [S.sub.w] in the case that every weight equals 1. When there is a pair of classes far from each other, the covariance ([[mu].sub.i] - [[mu].sub.j])[([[mu].sub.i] - [[mu].sub.j]).sup.T] will be large and totally deter mine [S.sub.b], leading to the fact that the final projecting matrix excessively emphasizes the classes which are already classified well but make less use of the pair of classes (i', j') near each other. However, the pairs of classes near each other deserve more attention.

According to the above discussion, the weight should be reduced for the class pairs far away from each other while it should be increased for the class pairs near each other. [F.sub.c] is obtained from weighting [S.sub.b]:

[F.sub.c] = [C-1.summation over (i=1)][C.summation over (j=i+1)][p.sub.i][p.sub.j]w(i, j)([[mu].sub.i] - [[mu].sub.j]) [([[mu].sub.i] - [[mu].sub.j]).sup.T], (35)

where

w(i, j) = [1/n][n.summation over (k=1)][gamma]([[mu].sub.i](k), [[mu].sub.j](k)), (36)

where n is the vector dimensionality, [[mu].sub.i](k), [[mu].sub.j](k) are the kth feature in the mean vector of the ith and jth class, respectively, and [gamma]([[mu].sub.i](k), [[mu].sub.j](k)) is the correlation coefficient of the kth feature in the mean vector of the ith and jth class, defined as follows:

[mathematical expression not reproducible]. (37)

Similarly, the objective of the factor analysis criterion is to remove the style factor as much as possible. Thus, the contribution to solve the eigenvalue from the sample pair near each other in the same class should be reduced, since it is possible that this pair of samples has little interference from style factor; the contribution to solve the eigenvalue from the sample pair far away from each other in the same class should be increased, since it is certain that one sample of this pair contains style factor, and the optimization function of the criterion should emphasize this kind of pairs. [F.sub.s] is obtained from weighting [S.sub.w]:

[F.sub.s] = [C.summation over (i=1)][[N.sub.i]-1.summation over (j=1)][[N.sub.i].summation over (k=j+1)]w(j,k) ([x.sup.(i).sub.j] - [x.sup.(i).sub.k])[([x.sup.(i).sub.j] - [x.sup.(i).sub.k]).sup.T], (38)

where the weighting function w(j, k) is defined as follows:

w(j, k) = exp(-[[parallel][x.sub.j] - [x.sub.k][parallel].sup.2]]/2[t.sup.2]]), (39)

where t is an empirical constant. The distance within class is more, the weight in [S.sup.-1.sub.w] is heavier, and vice versa.

Given the weighted [F.sub.c], [F.sub.s], FA-LDA is defined as follows:

[mathematical expression not reproducible]. (40)

3.2. FA-LPP. The style variance in the same locality manifold is minimized after projecting in LPP algorithm; namely, the locality manifold is preserved in low-dimensional space after projecting. We improve the LPP algorithm based on factor analysis criterion so that the style variance in locality manifold of the same class is minimized and the content variance in locality manifold of different classes is maximized after projecting. Thus, the style factor of FA-LPP is

[F.sub.s] = [C.summation over (i=1)][[N.sub.i]-1.summation over (j=1)][[N.sub.i].summation over (k=j+1)][w.sup.s.sub.jk] [[parallel][x.sup.(i).sub.j] - [x.sup.(i).sub.k][parallel].sup.2], (41)

where weight matrix [w.sup.s.sub.jk] is

[mathematical expression not reproducible]. (42)

The content factor of FA-LPP is

[F.sub.c] = [C-1.summation over (i=1)][C.summation over (j=i+1)]w(i, j)([[mu].sub.i] - [[mu].sub.j]) [([[mu].sub.i] - [[mu].sub.j]).sup.T]. (43)

[[mu].sub.i], [[mu].sub.j] are the mean values of the k-nearest neighbors of the ith and jth classes, respectively, and w(i, j) is computed by (36). Given the definition of [F.sub.c], [F.sub.s], FA-LPP can be defined as follows:

[mathematical expression not reproducible]. (44)

4. Experiments

4.1. FA-LDA. To evaluate the proposed FA-LDA algorithm, we compare it with the LDA algorithm on artificial data and real-world face dataset. During the testing phases, the nearest neighbor (NN) rule was used in classification. Note that [24] proposed a probabilistic graphical model framework to mimic the process of generating a face image. However, they focus on addressing the obstacles of small sample set, occlusion, and illumination variations in one sample face identification. So we do not compare our algorithm with [24].

Experiments on Artificial Data. The artificial data includes 3 subsets with 3, 4, and 5 classes, respectively, and the prior probability of each class is identical. Each class contains 60 samples with 9 dimensions. There is one class easy to discriminate in each subset. Thus, its within class covariance matrix plays a dominant role compared to other classes. Meanwhile, assume that the other classes obey the normal distribution and have identical within class covariance matrix. In experiments, 10 samples are selected as train set and the other 50 samples are equally divided into 10 groups as test set. The experiments of each class are conducted 10 times and each experiment is repeated 10 times. The classification rate on average and variance of the two algorithms are listed in Table 1. It can be seen that the classification performance of LDA algorithm is not good on this artificial data, which indicates the limitation of LDA. FA-LDA is superior to LDA in classification property. Since LDA algorithm simply uses the average of the within class scatter matrix as unified covariance matrix, it leads to an overemphasis on the scatter matrix of the class easy to discriminate. However, for FA-LDA algorithm the weights of this easily separated class are small and more attention is on the confusable classes.

Experiments on Real-World Data. The ORL database contains 400 images of 40 individuals which is used for real-world experiment. The images were captured at different times and with different variations including expression and facial details, which can be seen from Figure 2. In order to evaluate the superiority of FA-LDA, we compare it with the LDA and current weighted algorithms, in which WLDA1 [17] is based on Euclidean distance and WLDA2 [19] is based on confusion matrix. For training, we randomly selected different numbers (three, four, five, six, seven, eight, and nine) of images per individual and used the rest of the images for testing. Such a trial was independently performed 10 times, and then the average recognition results were calculated. The results are shown in Figure 3. The classification property of LDA is the worst, which illustrated the validity the weighting scheme based on factor analysis criterion. FA-LDA algorithm achieves the best performance for it not only weights [S.sub.b] to reduce the influence of the far away class, but also weights [S.sub.w] to improve the anti-interference ability of discriminant criterion.

4.2. FA-LPP. To evaluate the proposed FA-LPP algorithm, we compare it with LPP and MFA/DLA on ORL and Yale database. The Yale database is built by computational vision and control center at Yale University. Everyone has 11 images varying in expressions (blinking, glad, sad, or surprised), illumination conditions (frontal, left, or right), and facial details (glasses or no glasses), which can be seen from Figure 4.

For the ORL and Yale database, the image set is partitioned into the different gallery and probe sets where [G.sub.m]/ [P.sub.n] indicates that m images per person are randomly selected for training and the remaining n images are used for testing. For the LPP, the important parameters include k (the number of Neighbor Measurements). For the MFA and DLA algorithm, the important parameters include [k.sub.1] (the number of Neighbor Measurements of the Same Class), [k.sub.2] (the number of Neighbor Measurements of Different Classes). For the proposed FA-LPP, the important parameters include k (the number of Neighbor Measurements of the Same Class).

Each experiment is repeated 10 times and the average classification rate on ORL and Yale database is listed in Tables 2 and 3, which show the superiority of our algorithm. The LPP algorithm only utilizes the nearest neighborhood for partition, while MFA, DLA, and FA-LPP algorithms consider not only the nearest neighborhood but also the class information. Thus, the classification accuracies of MFA, DLA, and FA-LPP algorithms are higher than LPP algorithm. Since the weighting scheme is also taken into account for FA-LPP, the classification accuracy of FA-LPP is higher than MFA and DLA. What is more, there is only one parameter of nearest neighborhood to design for FA-LPP algorithm.

4.3. Comprehensive Experiments. In order to comprehensively analyze the performance of FA-LDA and FA-LPP, we conducted facial age group classification experiment and compare it with LDA, WLDA1, WLDA2, LPP, and DLA. In this experiment, the images for age group classification are collected from multiple data sources like 1002 facial images from FG-NET database, 500 images from Google database, and 600 images from the scanned photographs leading to a total of 2102 sample facial images. A few of them are shown in Figure 5. For experimental purposes the total 2102 images are split into two training and test sets. Training set consists of 700 images from FG-Net aging database, 300 images from Google, and 400 images collected from scanned images leading to a total of 1400 images. Testing set consists of 302 images from FG-Net database, 200 images from Google, and 200 images from scanned photographs, totaling 702 images.

For the four-class classification, faces in the training dataset will be assigned to age groups according to their labeled age. As every face has an age value labeled, it can provide necessary data for building the factors analysis framework.

In this experiment, the Gabor feature was used first in original facial image, and then we used the FA-LDA, FA-LPP, LDA, WLDA1, WLDA2, LPP, and DLA algorithms in feature extraction and dimensionality reduction. The comparative results in Figure 6 show that our methods (FA-LDA, FA-LPP) outperformed others. From this result, the classification accuracies of DLA and our FA-LDA and FA-LPP are higher than LDA, WLDA1, WLDA2, and LPP algorithms. The reason is that LDA and LPP are used for linear dimensional reduction without considering weight design problems, respectively, and WLDA1 and WLDA2 are not considered partition design. Furthermore, since LPP is based on manifold learning which is suitable for nonlinear problem such as face age group classification, therefore, the LPP and FA-LPP are better than LDA and FA-LDA, respectively.

Recently, deep learning is a new area of machine learning research which has got hot attention [25, 26]. Our factors analysis framework can be embedded in deep learning based object recognition system as feature dimension reduction part. Therefore, our factors analysis framework can be seamlessly integrated with the deep learning method.

5. Conclusions

In this paper, we aim to provide insights into the relationship among the state-of-the-art dimensionality reduction algorithms as well as to facilitate the design of new algorithms. A general framework known as factor analysis, along with its linearization, kernelization, and tensorization, has been proposed to provide a unified perspective for the understanding and comparison of many popular dimensionality reduction algorithms. Moreover, the factor analysis framework can be used as a general platform to develop new algorithms for dimensionality reduction. As shown in this paper, we have proposed two novel dimensionality reduction algorithms called FA-LDA and FA-LPP by designing the objective function of factors separation that characterize the weight matrix and the partition and by optimizing their corresponding criteria based on the factor analysis framework. These new algorithms are shown to effectively overcome the data distribution assumption of the traditional LDA and LPP algorithm. Thus, FA-LDA and FA-LPP are more general algorithms for discriminant analysis.

A byproduct of this paper is a series of kernelization and tensorization versions of the factors analysis. One of our future works is to systematically compare all possible extensions of the algorithms mentioned in this paper.

http://dx.doi.org/10.1155/2016/9347838

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work is supported partially by China Postdoctoral Science Foundation (no. 2015M582355), the Doctor Scientific Research Start Project from Hubei University of Science and Technology (no. BK1418), the National Natural Science Foundation of China (NSFC) (no. 61271256), Provincial Natural Science Foundation of Hubei Province (no. 2015cfc778), and National Natural Science Foundation of China (NSFC) (no. 61373108).

References

[1] Q. Zhang, L. Zhang, Y. Yang, Y. Tian, and L. Weng, "Local patch discriminative metric learning for hyperspectral image feature extraction," IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 3, pp. 612-616, 2014.

[2] Y. Fu, S. Yan, and T. S. Huang, "Classification and feature extraction by simplexization," IEEE Transactions on Information Forensics & Security, vol. 3, no. 1, pp. 91-100, 2008.

[3] A. M. Martinez and A. C. Kak, "PCA versus LDA," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001.

[4] J. Ye, R. Janardan, C. H. Park, and H. Park, "An optimization criterion for generalized discriminant analysis on undersampled problems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 982-994, 2004.

[5] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, "Face recognition using Laplacianfaces," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 328-340, 2005.

[6] Y. Su, Y. Fu, X. Gao, and Q. Tian, "Discriminant learning through multiple principal angles for visual recognition," IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1381-1390, 2012.

[7] J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, no. 5500, pp. 2319-2323, 2000.

[8] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear embedding," Science, vol. 290, no. 5500, pp. 2323-2326, 2000.

[9] M. Belkin and P. Niyogi, "Laplacian eigenmaps and spectral techniques for embedding and clustering," in Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds., pp. 585-591, 2001.

[10] K. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf, "An introduction to kernel-based learning algorithms," IEEE Transactions on Neural Networks, vol. 12, no. 2, pp. 181-201, 2001.

[11] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H.-J. Zhang, "Discriminant analysis with tensor representation," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), pp. 526-532, IEEE, June 2005.

[12] J. Yang, D. Zhang, A. F. Frangi, and J.-Y. Yang, "Two-dimensional PCA: a new approach to appearance-based face representation and recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 1, pp. 131-137, 2004.

[13] J. Ye, "Generalized low rank approximations of matrices," in Proceedings of the 21st International Conference on Machine Learning (ICML '04), p. 112, Banff, Canada, July 2004.

[14] J. Ye, R. Janardan, and Q. Li, "Two-dimensional linear discriminant analysis," Advances in Neural Information Processing Systems, vol. 17, pp. 1569-1576, 2005.

[15] V. Popa, J. Nurminen, and M. Gabbouj, "A study of bilinear models in voice conversion," Journal of Signal and Information Processing, vol. 2, no. 2, pp. 125-139, 2011.

[16] H. Tan, B. Cheng, J. Feng, G. Feng, and Y. Zhang, "Tensor recovery via multi-linear augmented lagrange multiplier method," in Proceedings of the 6th International Conference on Image and Graphics (ICIG '11), pp. 141-146, Hefei, China, August 2011.

[17] Y. Li, Y. Gao, and H. Erdogan, "Weighted pairwise scatter to improve linear discriminant analysis," in Proceedings of the 6th International Conference on Spoken Language Processing, vol. 4, pp. 608-611, Beijing, China, October 2000.

[18] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, "Multiclass linear dimension reduction by weighted pairwise Fisher criteria," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 762-766, 2001.

[19] H.-S. Lee and B. Chen, "Linear discriminant feature extraction using weighted classification confusion information," in Proceedings of the 9th Annual Conference of the International Speech Communication Association (INTERSPEECH '08), pp. 2254-2257, Brisbane, Australia, September 2008.

[20] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, "Graph embedding and extensions: a general framework for dimensionality reduction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, 2007.

[21] T. Zhang, D. Tao, X. Li, and J. Yang, "Patch alignment for dimensionality reduction," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1299-1313, 2009.

[22] J. Yang, S. Yan, and T. S. Huang, "Ubiquitously supervised subspace learning," IEEE Transactions on Image Processing, vol. 18, no. 2, pp. 241-249, 2009.

[23] R. Lotlikar and R. Kothari, "Fractional-step dimensionality reduction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp. 623-627, 2000.

[24] X. Zhao, X. Li, Z. Wu, Y. Fu, and Y. Liu, "Multiple subcategories parts-based representation for one sample face identification," IEEE Transactions on Information Forensics & Security, vol. 8, no. 10, pp. 1654-1664, 2013.

[25] Y. Jia, E. Shelhamer, J. Donahue et al., "Caffe: convolutional architecture for fast feature embedding," http://arxiv.org/abs/ 1408.5093.

[26] S. Yang, P. Luo, C. Loy, K. W. Shum, and X. Tang, "Deep representation learning with target coding," in Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI '15), Austin Tex, USA, January 2015.

Ningbo Hao, (1,2) Jie Yang, (1) Haibin Liao, (3,4) and Wenhua Dai (3)

(1) School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

(2) School of International College, Huanghuai University, Zhumadian, Henan 463000, China

(3) School of Computer Science and Technology, Hubei University of Science and Technology, Xianning 437100, China

(4) National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

Correspondence should be addressed to Haibin Liao; liao_haibing@163.com

Received 22 October 2015; Revised 7 March 2016; Accepted 28 March 2016

Caption: Figure 1: Face factors analysis (the content factors which without illumination and poses are what we want to extract and the style factors which with illumination and poses are what we want to remove).

Caption: Figure 2: The sample facial images of ORL database.

Caption: Figure 3: Best recognition rates of four algorithms on the testing sets of ORL.

Caption: Figure 4: The sample facial images of Yale database.

Caption: Figure 5: Sample images of FG-Net aging database (the same person with face images of different ages).

Caption: Figure 6: Comparisons of the performance of different approaches on facial age database.
```Table 1: Classification accuracies of LDA and FA-LDA on the
artificial database.

3 classes

FA-LDA                         LDA

83.5 [+ or -] 7.45       62 [+ or -] 8.34
81 [+ or -] 9.01        56.7 [+ or -] 7.45
83.5 [+ or -] 9.07     59.5 [+ or -] 10.53
80.5 [+ or -] 8.67       57 [+ or -] 9.57
81 [+ or -] 9.01       53.5 [+ or -] 12.45
81.5 [+ or -] 6.26      61.5 [+ or -] 9.57
84.5 [+ or -] 6.58       60 [+ or -] 6.7
82.5 [+ or -] 8.08     59.5 [+ or -] 11.56
81 [+ or -] 8.45        64.5 [+ or -] 6.87

4 classes

FA-LDA                  LDA

75.5 [+ or -] 5.6      38.54 [+ or -] 10.21
73 [+ or -] 6.68       45.85 [+ or -] 7.37
73.5 [+ or -] 5.62     46.33 [+ or -] 5.36
76.67 [+ or -] 5.6      42.64 [+ or -] 7.6
78.56 [+ or -] 5.21    43.85 [+ or -] 7.49
78.85 [+ or -] 5.34    40.28 [+ or -] 6.35
80.21 [+ or -] 8.63    42.84 [+ or -] 5.49
78.26 [+ or -] 5.23    41.45 [+ or -] 8.16
75.95 [+ or -] 6.72    43.58 [+ or -] 6.53

5 classes

FA-LDA                  LDA

78.26 [+ or -] 5.4     45.26 [+ or -] 7.85
75.66 [+ or -] 6.28    46.15 [+ or -] 8.21
75.5 [+ or -] 7.25     39.54 [+ or -] 9.25
79.45 [+ or -] 5.5     48.26 [+ or -] 6.51
76.5 [+ or -] 7.50     44.58 [+ or -] 9.21
70.57 [+ or -] 9.14    39.83 [+ or -] 5.68
75.64 [+ or -] 5.60    46.25 [+ or -] 7.35
76.42 [+ or -] 5.60    42.35 [+ or -] 5.36
77.38 [+ or -] 6.58    46.25 [+ or -] 8.24

Table 2: Best recognition rates of four algorithms on the testing sets
of ORL.

[G.sub.3]/[P.sub.7]  [G.sub.4]/[P.sub.6]  [G.sub.5]/[P.sub.5]

LPP          85.0% (60)           89.6% (141)          96.0% (158)
MFA       89.3% (39, 2, 2)     91.4% (46, 2, 1)     96.0% (42, 4, 2)
DLA        94% (42, 2, 2)      92.5% (53, 2, 1)     95.4% (40, 4, 2)
FA-LPP      94.5% (39, 2)         92% (45, 2)         96.6% (40, 4)

For LPP, the numbers in the parentheses are the selected subspace
dimensions. For FA-LPP, the first numbers in the parentheses are the
selected subspace dimensions and the second numbers are the
parameters. For MFA and DLA, the first numbers in the parentheses are
the selected subspace dimensions and the second and the third numbers
are the parameters [k.sub.1] and [k.sub.2], respectively.

Table 3: Best recognition rates of four algorithms on the testing sets
of Yale.

[G.sub.3]/[P.sub.8]  [G.sub.5]/[P.sub.6]  [G.sub.7]/[P.sub.4]

LPP          63.5% (13)           72.3% (14)           80.5% (14)
MFA       61.0% (13, 2, 2)     74.5% (14, 2, 1)     79.7% (21, 4, 2)
DLA       66.8% (15, 2, 2)     78.9% (23, 2, 1)     81.2% (13, 4, 2)
FA-LPP      73.2% (14, 2)        79.5% (16, 2)        85.4% (15, 4)

For LPP, the numbers in the parentheses are the selected subspace
dimensions. For FA-LPP, the first numbers in the parentheses are the
selected subspace dimensions and the second numbers are the
parameters. For MFA and DLA, the first numbers in the parentheses are
the selected subspace dimensions and the second and the third numbers
are the parameters [k.sub.1] and [k.sub.2], respectively.
```