# Size aware correlation filter tracking with adaptive aspect ratio estimation.

AbstractCorrelation Filter-based Trackers (CFTs) gained popularity recently for their effectiveness and efficiency. To deal with the size changes of the target which may degenerate the tracking performance, scale estimation has been introduced in existing CFTs. However, the variations of the aspect ratio were usually neglected, which also influence the size of the target. In this paper, Size Aware Correlation Filter Trackers (SACFTs) are proposed to deal with this problem. The SACFTs not only determine the translation and scale variations, but also take the aspect ratio changes into consideration, thus a better estimation of the size of the target can be realized, which improves the overall tracking performance. And competing results can be achieved compared with state-of-the-art methods according to the experiments conducted on two large scale datasets.

Keywords: Object tracking, visual tracking, correlation filter, correlation filter-based trackers, aspect ratio

1. Introduction

As a hot topic in computer vision community, object tracking has been used in lots of scenarios, including augmented reality, human-computer interaction, surveillance, robotics, motion analysis, to name but a few. According to the number of objects need to track, object tracking can be divided into single-object tracking and multi-object tracking. In this paper, we confine ourselves to the former one and object tracking refers to single-object tracking unless we explicitly state otherwise.

Given the location and size of the target in the first frame (usually in the form of a bounding box), object tracking aims to estimate its new location and size in the following frames. Although great achievements have been made, object tracking is still challenging due to appearance changes (including size variations, deformation, etc.), background clutter, illumination variations, fast motion, and occlusion [1].

Existing approaches for object tracking can be broadly classified into two categories, i.e., generative methods and discriminative methods. In generative methods, firstly a model of the target is built using subspace [2], sparse representation [3], etc., then tracking is performed by searching for the region that matches the model best in each frame. Satisfactory performance can be achieved provided that the target undergoes less variability. However, only the information from the foreground target is taken into consideration and the potential useful information from the background is usually neglected.

For discriminative methods, tracking is considered as a binary classification problem. Both foreground and background are sampled to extract features to train a classifier by using naive Bayes [4], logistic regression [5,6], SVM [7,8], boosting [9-11], random forest [12,13], etc. Then the classifier is used to distinguish the target from its surrounding background. Discriminative methods are believed to outperform their generative counterpart in complicated environments [14,15] and thus have attracted far more attention. A thorough comparison of different methods is out of the scope of this paper, and readers can refer to two distinguished surveys on object tracking [1,16] for more details.

Among the discriminative methods, Correlation Filter-based Trackers (CFTs) gained popularity recently for their effectiveness and efficiency. A correlation filter is a template used for cross-correlating with signals and has been widely used in pattern recognition fields. Bolme et al. first introduced the correlation filter into object tracking and proposed a filter named Minimum Output Sum of Squared Error (MOSSE) [17]. Similar as traditional correlation filters, the location corresponding to the highest correlation output is taken as the position of the tracked object. Moreover, the filter can be updated online to adapt to the changing appearance of the target. Due to the Fast Fourier Transform (FFT) adopted in the correlation process, the tracking frame rate can be up to 600 frames-per-second (fps).

After analyzing the relation between correlation filters and ridge regression with cyclically shifted samples, Henriques et al. proposed a discriminative tracker named Circulant Structure of tracking-by-detection with Kernels (CSK) [6]. Both positive and negative samples are densely collected in the target's neighborhood for classifier training. The cyclically shifted dense sampling does not make the computational cost high. On the contrary, it induces the circulant structure and makes the use of FFT in training and detection possible, which speeds up the tracker dramatically. The kernel trick [18] was introduced to enhance the discriminative ability of the classifier. In the evaluation of 29 trackers recently performed by Wu et al. [1] using a large scale Online Tracking Benchmark (OTB), the highest frame rate (over 300 fps) was achieved by CSK [6] among those ranking top 10.

After these pioneering work, numerous improvements or variants have been proposed and readers can refer to [19] for a survey. One aspect that much attention has been paid is exploiting more powerful feature representations in classifier training to better discriminate the target from the background. In a tracker named Kernelized Correlation Filter (KCF) [5], Henriques et al. introduced the Histogram of Gradient (HOG) feature [20] and extended the use of one-channel feature to multi-channel feature. Danelljan et al. [21] investigated the contribution of different color features to tracking and proposed to integrate Color Names feature [22]. Danelljan et al. [23] proposed to use convolutional features trained by Convolutional Neural Networks (CNNs). Experiments demonstrate that the performance of CFTs can be improved using these powerful features compared with the originally used grayscale intensity feature.

Another aspect that lots of efforts have been devoted to is using correlation filter theory to estimate more state parameters of the target. Early CFTs sample image patch with fixed size and track the position of the target in successive frames, which is in essence estimating the translations of the target only. However, other state parameters also undergo variations due to the target's ego-motion including translations, in-plane rotations, out-of-plane rotations, etc., the motion of the camera, and the changes of the viewpoints, to name but a few. To additionally estimate the scale changes of the target, Li et al. [24] and Danelljan et al. [25] proposed two CFTs named SAMF and DSST respectively, which together with the KCF [5] mentioned above ranked top 3 in Visual Object Tracking challenge 2014 (VOT2014) [26]. And readers can refer to [26] for detailed results and comparisons. Besides scale changes, Zhang et al. [27] also estimated orientation changes by transferring the image patch from Cartesian coordinate system to Log-Polar coordinate system.

Despite the success existing CFTs have achieved, to the best of our knowledge, the estimation of aspect ratio changes was usually neglected. The aspect ratio refers to the ratio of the target's width and height. Considering that the aspect ratio and the scale jointly determine the size of the target, estimating both of them would further improve the performance of CFTs.

In this paper, we propose two novel Size Aware Correlation Filter Trackers (SACFTs) on the basis of SAMF [24] and DSST [25] respectively. The SACFTs not only estimate the translations and scale changes of the target like SAMF [24] and DSST [25], but also take the variations of the aspect ratio into consideration to realize a better estimation of the size.

The reminder of the paper is organized as follows. We introduce the background and related works in Section 2 to be self-contained. Then two novel Size Aware Correlation Filter Trackers (SACFTs) are proposed in Section 3. Afterwards, we conduct thorough experiments by using the OTB dataset [1] and the Visual Object Tracking challenge 2015 (VOT2015) dataset [28] in Section 4, followed by the conclusion in Section 5.

2. Background and related works

2.1 Correlation filter and MOSSE [17]

Correlation filters were originally proposed for pattern recognition applications like target recognition or detection [29]. An example of car detection using a correlation filter is shown in Fig. 1, which includes two sub-tasks as follows: 1) determining how probable an image patch may contain the car and 2) localizing the car. Assume that the searching window is M in height and N in width, correlation filter fulfills the tasks by sliding the filter or template within the searching window to get M x N samples, and cross-correlating with them to obtain the correlation output. The correlation responses {[y.sub.i] | i = 1,2,.. M x N} of the given filter [omega] and features {[x.sub.i] | i = 1,2,.. M x N} extracted from these samples (grayscale intensity in this case) can be calculated according to

y = co [cross product] x (1)

where [cross product] indicates the cross-correlation operation.

To facilitate the localization, the responses can be rearranged in a matrix indexed by (u, v)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

where mod(i, N) is the modulo operation and returns the remainder of i after divided by N. (u, v) is the center of corresponding image patch represented in the image coordinate system, and can also be taken as the translation with respect to the origin.

The output matrix can be visualized in the manner of 2D or 3D. As shown in Fig. 1, the hotter the color, the stronger the response is, and the more probable the corresponding image patch contains the car. It is obvious that only when the filter correlating with the image patch marked by the red bounding box, which is the same as the template and assumed to be centered at the coordinate (u, v), the peak value [y.sub.(u, v)] in the output matrix can be achieved. Then simultaneously the car is detected and localized.

Similar thoughts are used by MOSSE [17] in which a training and detection framework is adopted. Unlike the constant filter shown in Fig. 1, an online-updating filter is used to adapt to the appearance changes. MOSSE [17] repeatedly detects the target by finding the translation corresponding to the peak correlation response of the global translation filter [[omega].sub.global_t], and then updates [[omega].sub.global_t] using local translation filters [[omega].sub.global_t] trained on each frame.

As shown within the red box of Fig. 2, the fast detection is conducted in the frequency domain for speed concerns. According to the cross-correlation theorem, Equation (1) can be converted to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

where [??] indicates the Hadamard product, * indicates the complex conjugate, F and [F.sup.-1] indicate the Fast Fourier Transform (FFT) operation and inverse Fast Fourier Transform ( iFFT ) operation respectively, and variables with a hat denote their FFTs. Here we replace x with z to explicitly distinguish features for training from features for detection. And sampled features are preprocessed using a cosine window to mitigate artifacts caused by wrapped-around edges as well as to emphasize the center regions.

As shown within the red box of Fig. 3, fast training is also conducted in the frequency domain. The training aims to find a local translation filter [[omega].sub.local_t] minimizing the sum of squared errors between the actual correlation responses on features {[x.sub.i] |i = 1,2...,n} extracted from n samples and the desired responses {[y.sub.i] |i = 1, 2..., n}

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

where the desired responses are calculated according to a 2D Gaussian function whose peak location coincides with the target center. And Equation (4) has a closed-form solution

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)

where [??] indicates the Hadamard divide, [N.sub.local_t] and [D.sub.local_t] denote the numerator term and denominator term respectively. Also a small variable [lambda] can be added to the denominator term to avoid division-by-zero.

Both of the initialization and update of the global translation filter [[omega].sub.global_t] used in the detection are performed according to local translation filters [[omega].sub.local_t] trained on each frame. The [[omega].sub.local_t] trained on the first frame is used for the initialization, while [[omega].sub.local_t] trained on the following frames are used for the update according to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (6)

where [[eta].sub.t] is the learning rate determining how fast the filter adapts to the appearance changes. The higher [[eta].sub.t] is, the faster the adaption is.

The samplings of MOSSE [17] differ in the training and detection. In the training stage, image patches are randomly sampled around the target center, as shown by the green bounding boxes in Fig. 4a. However, dense sampling is performed explicitly in the detection stage due to the sliding of the filter, as shown in Fig. 1. Moreover, MOSSE [17] only uses one-channel grayscale intensity as the feature.

2.2 CSK [6] and KCF [5]

In CSK [6], a local linear classifier [y = [[[omega].sup.T.sub.local]x + b]] is trained in every frame using the ridge regression. Assume that n samples are obtained in the frame, CSK [6] aims to solve the equation as follows

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

where {[x.sub.i] | i = 1, 2,...n} are the features extracted from the samples, {[y.sub.i] |i = 1, 2..., n} are the corresponding desired responses, and [lambda] is the regularization parameter. Equation (7) has a closed-form solution

[[omega].sup.*.sub.local] = ([X.sup.T] X + [lambda]I)[X.sup.T]Y (8)

where X =[[[x.sub.1],[x.sub.2]...[x.sub.n]].sup.T], Y = [[[y.sub.1],[y.sub.2]...[y.sub.n]].sup.T], I denotes the identity matrix, and T indicates the transpose.

The sampling in the training of CSK [6] is quite different from MOSSE [17]. As shown in Fig. 4b, CSK [6] firstly pads the target (the red bounding box) to get the enlarged image patch which contains more information from the background (the green bounding box). The enlarged image patch is called the base sample because dense sampling is achieved by shifting it cyclically with the permutation matrix P.

Assume that the one-channel d dimensional feature (also grayscale intensity in CSK [6]) extracted from the base sample is x = [[[x.sup.1], [x.sup.2]...[x.sup.d]].sup.T], and the corresponding response is y, thus the dense sampling manner renders the matrix X mentioned above circular as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

where [P.sup.l] shifts x by l elements and each row of X corresponds to the feature extracted from a sample. Thus Equation (8) can be converted to Equation (10) taking advantage of the property of the circular matrix

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (10)

By comparison we can see that Equation (5) and Equation (10) equal when only one base sample is used by the MOSSE [17] for training. The circular structure exploited makes explicitly shifting and sampling unnecessary, and all the information needed for training [[omega].sub.local] is x. And the rapid computation performed in the Fourier domain further speeds up the training process.

To enhance the discriminative ability, CSK [6] also introduced the kernel trick [18] in the training stage to represent the original solution [[omega].sub.local] in the primal space by [[alpha].sub.local] in the dual space

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (11)

where [[alpha].sup.i.sub.local] is the ith element of [[alpha].sub.i], and [PHI] maps the linear features to nonlinear ones.

The classifier trained with the nonlinear regression is

[[alpha].sub.local] = (K + [lambda]I)Y (12)

where K is the kernel matrix with the element [K.sub.ij] determined by [[PHI].sup.T] ([x.sub.i])[PHI]([x.sub.j]). Considering that the samples in CSK [6] are generated by shifting the base sample cyclically, K is determined by x only. Also it is proved that some kernels, for example the Gaussian kernel, can preserve the circular structure of the inputs, i.e. K is circular. And taking advantage of the circular matrix again, Equation (12) can be converted to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (13)

where [k.sup.xx] is the first row of kernel matrix K. However, the explicit calculation of K is unnecessary to get [k.sup.xx] and rapid computation can be performed instead. For example, if the commonly used Gaussian kernel [??] is adopted, it can be calculated according to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (14)

Similarly, the detection can also be performed rapidly by taking advantage of the circular matrix induced by cyclically shifting the base sample. Assume that the one-channel d dimensional feature extracted from the base sample for detection is z, CSK [6] computes the responses of all the d samples according to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (15)

where f(z) is a vector and each element of it corresponds to a response. [x.sub.global] and [[alpha].sub.lobal] are calculated according to [x.sub.local] and [[alpha].sub.local] obtained in each frame, the results from the first frame are used for initialization, and the others are used for update according to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (16)

where [mu] and [xi], are the learning rates, and [??] can also be computed rapidly by using Equation (14).

However, only features with one channel, grayscale intensity for example, can be used in CSK [6]. And KCF [5] introduces the use of multi-channel features by changing Equation (14) to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (17)

where c is the number of channels of the features used.

3. Our proposed algorithm

As mentioned in the introduction, SAMF [24] and DSST [25] partially solve the problem of target size variations by estimating the scale changes. On the basis of them, we propose two Size Aware Correlation Filter Trackers (SACFTs), which not only determine the location and scale variations of the target, but also take changes of aspect ratios into consideration to provide a better estimation of the size.

The framework of the proposed SACFTs is shown in Fig. 5. Firstly, global filters are initialized using the given state variables including the translation t, scale s and aspect ratio a in the first frame. Then, fast detection is performed in a new frame to detect the target as well as to update these state variables by using global filters, with the t, s, and a from the last frame as the initialization. After that, fast training is conducted in current frame according to the updated state variables to get local filters, which are used to update the global filters for detection. Both detection and training take the features obtained by sampling as the input. So the tracking can be realized by repeatedly alternating between detection and training in every frame.

3.1 SACFT1

On the basis of KCF [5], SAMF [24] deals with the scale changes using a scale pool for searching. The sampling for training is the same as KCF [5], as shown in Fig. 6b. And only a translation filter is trained in each frame by using Equation (13) and Equation (14). However, the sampling for detection is quite different. As shown in Fig. 6a, a scale pool S={[s.sub.i] |i = 1,2,...[N.sub.S]} is defined in SAMF [24], where each element corresponds to a scale in the searching space along the vertical axis. To further tackle the aspect ratio variations, we introduce the aspect ratio pool A={[a.sub.j]|j = 1,2,...[N.sub.A]} in SACFT1, where each element corresponds to an aspect ratio in the searching space along the horizontal axis.

Assume that the height and width of the image patch corresponding to the target (the red bounding box) are [M.sub.0] and [N.sub.0] respectively, it is padded to the size [M.sub.1] x [N.sub.1] to get the enlarged base sample (the green bounding box). After that, the scale sampling and aspect ratio sampling are jointly performed to obtain [N.sub.S] x [N.sub.A] base samples with the sizes {[s.sub.i][a.sub.j][M.sub.1] x [s.sub.i][N.sub.1] | [s.sub.i] [MEMBER OF] S,[a.sub.j] [member of] A}. And then all these base samples are resized to the size [M.sub.1] x [N.sub.1] to make sure that the dimension of the extracted features [??] are the same.

Similar as SAMF [24], the feature used is the one concatenating the grayscale intensity, HOG [20], and Color Names [22]. To estimate the translation t, scale s and aspect ratio a, implicit dense sampling is performed on [N.sub.S] x [N.sub.A] base samples. The responses are calculated using Equation (15) repeatedly [N.sub.S] x [N.sub.A] times to obtain the response vectors [??]. If the peak response is obtained in the [k.sup.th] element of the response vector [??], then the scale is determined to be [s.sub.i], the aspect ratio is determined to be [a.sub.j], and the translation is determined to be the displacement corresponding to the [k.sup.th] element.

3.2 SACFT2

Unlike SAMF [24] in which only a translation filter is used to determine the translations and scale variations, DSST [25] solves the problem by adopting a divide-and-rule strategy. This is based on the assumption that the scale variation between two successive frames is smaller than the translation, which usually holds true in visual tracking. As shown within the purple box of Fig. 7d, the estimations of translation and scale are decoupled and solved sequentially in DSST [25]. We assume that the aspect ratio variation between two successive frames is smaller than the scale change, and use an extra aspect ratio filter [[omega].sub.global_a] to estimate it.

The feature used in SACFT2 is the same as SACFT1. The estimation of the translation is performed using the translation filter [[omega].sub.global_t], which is similar as MOSSE [17] but only the base sample is used in training and detection. Besides, when calculating the responses of the translation filter, Equation (15) is used instead of Equation (3) adopted in MOSSE [17] and DSST [25].

A scale filter [[omega].sub.global_s] is used for scale estimation. After determining the translation in current frame, the updated t is used in the scale sampling. Similar as SAMF [24], the scale sampling is conducted by cropping [N.sub.S] image patches according to the defined scale pool S={[s.sub.i]|i = 1,2,...[N.sub.S]}. But unlike SAMF [24], no padding is performed, i.e. the size of these samples are {[s.sub.i][M.sub.0] x [s.sub.i][N.sub.0] | [s.sub.i] [member of] S}, and all the samples are resized to a specific size [M.sub.2] x [N.sub.2], as shown in Fig. 7b.

As shown in the middle column of Fig. 3, to get the local scale filter [[omega].sub.global_s] in each frame, extracted features {[x.sub.i] |i = 1,2,...[N.sub.S]} from these samples are preprocessed with a cosine window, then trained using

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (18)

where [y.sub.i] is the desired response calculated according to a 1D Gaussian function whose peak value is achieved at the scale s=1.

And as shown in the middle column of Fig. 2, the scale is determined to be the one corresponding to the peak response in {[y.sub.i] | i = 1,2,...[N.sub.S]} calculated by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (19)

where [z.sub.i] is the feature extracted from the sample with the scale [s.sub.i], and [[omega].sub.global_s] is initialized by [??] in the first frame and updated using Equation (6).

An aspect ratio filter [[omega].sub.global_a] is used for aspect ratio estimation. After determining the scale in current frame, the updated s is used in the aspect ratio sampling, which is conducted by cropping [N.sub.A] image patches according to the defined aspect ratio pool A={[a.sub.j] |j = 1,2,...[N.sub.A]} around the target center. The sizes of these samples are {[sa.sub.j][M.sub.0] x [sN.sub.0] | [a.sub.j] [member of] A}, and all the samples are resized to a specific size [M.sub.2] x [N.sub.2], as shown in Fig. 7c.

As shown in the right column of Fig. 3, similar as the training of [[omega].sub.global_s], to get the local aspect ratio filter [[omega].sub.global_a] in each frame, extracted features {[x.sub.j] |j = 1,2,...[N.sub.A]} from these samples are preprocessed with a cosine window, and then trained using

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (20)

where y is the desired response calculated according to a 1D Gaussian function whose peak value is obtained at the aspect ratio a=1.

As shown in the right column of Fig. 2, the aspect ratio is determined to be the one corresponding to peak response {[y.sub.j] | j = 1,2,...[N.sub.A] } calculated by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (21)

where [z.sub.j] is the feature extracted from the sample with the aspect ratio [a.sub.j], and [[omega].sub.global_a] is initialized by [??] in the first frame and updated using Equation (6).

4. Experiments

4.1 Experimental setup

In this section, we thoroughly compare our proposed SACFTs with state-of-the-art methods using the Online Tracking Benchmark (OTB) [1] and the Visual Object Tracking challenge 2015(VOT2015) dataset [28]. All the experiments were performed on a PC with 2.3GHz i5-4200U CPU and 12GB RAM.

The SACFTs are implemented in Matlab. As to SACFT1, the scale pool used is S={[1.005.sup.n]|n [member of] [-3,-2,...,3]}, which is the same as SAMF [24], and the aspect ratio pool used is set to A= {[1.005.sup.n]|n [MEMBER OF] [-3,-2,...,3]} empirically. As to SACFT2, the scale pool used is S= {[1.02.sup.n]|n [MEMBER OF] [-17,-16,...,17]}, which is the same as DSST [25], and the aspect ratio pool used is also A={[1.005.sup.n]|n [member of] [-3,-2,..., 3]}. The learning rate [[eta].sub.t] for the translation estimation in SACFT1 and SACFT2 is 0.01, and the learning rate [mu] and [xi] for the scale and aspect ratio estimation in SACFT2 are set to 0.015.

4.2 Experiments using the OTB [1]

Built in 2013, the OTB [1] contains 50 fully annotated sequences with nearly 30,000 frames. The single object need to be tracked is set in the first frame by a bounding box in each sequence, and the tracking result ground truths are provided in the following frames for evaluation. To analyze the ability of a tracker to deal with different challenges, these sequence are also annotated with 11 attributes such as scale variation, out-of-plane rotation, deformation, etc. Besides, the results of 29 trackers, including CSK [6], Struck [8], TLD [12], etc, are also provided.

The evaluation methodology used in the OTB [1] are the Precision plot and the Success plot. The Precision plot is used to evaluate the Euclidean location errors between the estimated target centers and the ground truths. Given a location error threshold, the corresponding precision indicates the percentage of the frames whose location errors are within the threshold. The Success plot is used to evaluate the overlap between the estimated bounding boxes and the ground truths. Given an overlap threshold, the corresponding success rate indicates the percentage of frames whose overlap are larger than the threshold. From the definitions we can see that the larger the precision and success rate, the better the performance.

To further evaluate the robustness of the trackers, temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE) are introduced in the OTB [1]. Unlike the conventional one-pass evaluation (OPE) which initializes the ground truth in the first frame and then calculates the precision and success rate, the TRE and SRE perturb the initializations temporally and spatially respectively.

In our experiments, 34 trackers are quantitatively evaluated, including the 29 trackers mentioned above, KCF [5], SAMF [24], DSST [25], SACFT1, and SACFT2. For the TRE, each sequence is evaluated 20 times by performing tracking from different start frames to the end frame. And for the SRE, each sequence is evaluated 12 times by perturbing the initialization in the first frame with center shifts, corner shifts and scale variations.

The Precision and Success plots of OPE, SRE, and TRE of the 34 trackers over the 50 sequences are obtained, and only the top 10 are shown in Fig. 8 for clarity. The ranking of the trackers are shown in the legends. For the Precision plots, the ranking is according to the precision corresponding to the location error threshold of 20 pixels. While for the Success plots, the ranking is according to the area under curve (AUC).

As shown in Fig. 8, both SAMF [24] and DSST [25] outperform the state-of-the-art trackers in 2013. That's why they succeeded in the VOT2014 challenge and ranked top 2 [26]. Our proposed SACFT1 improves SAMF [24] in both precision and success scores slightly, as well as maintains the robustness to temporal and spatial perturbations. While SACFT2 surpasses DSST [25] in precision and success scores by 5.6% and 4.0% respectively, and achieves the gain in temporal and spatial robustness by approximately 5% in the meantime.

We also conduct analysis according to the 11 annotated attributes. By comparison we find that the proposed SACFTs improve the performance of SAMF [24] and DSST [25] in nearly all the attributes. And Success plots of OPE of the top 10 trackers over sequences annotated with scale variation, out-of-plane rotation, and deformation are shown in Fig. 9, in which obvious size changes tend to occur. SACFT1 and SACFT2 solve the same problem with different manners. Although both of them can acquire desirable results, some results may seem similar in Fig. 8. This is caused by the average of results across all sequences, some of which do not change in size. While in Fig. 9, the differences between SACFT1 and SACFT2 are more obvious. Considering better estimation of size variations can be achieved in our proposed methods by incorporating aspect ratio estimation, we name them as Size Aware Correlation Filter Trackers (SACFTs).

Some results of the top 8 trackers on sequences with obvious size changes are shown in Fig. 10. In (a) trellis and (c) freeman1, the out-of-plane rotation of the faces changes the aspect ratios, while in (b) CarScale1, the car's driving from far to near leads to the variations of the scales and aspect ratios. Our proposed SACFTs, especially the SACFT2 shown with red bounding box, can provide better tracking results.

The average frame rates of different methods are shown in Table 1. We can see that SAMF [24] and DSST [25] improve the performance of KCF [5] by incorporating the scale estimation at the cost of speed. SACFT1 slightly enhances the performance of SAMF [24] by taking the estimation of aspect ratio into consideration, but further lowers the frame rate because [N.sub.A ]times more base samples are collected. SACFT2 surpasses DSST [25] in performance with a relative large margin. However, the speed of SACFT2 is faster despite the extra computations introduced by the estimation of aspect ratio, because the computation of translation filter responses in SACFT2 using Equation (15) is much more efficient than Equation (3) used in DSST [25].

In each frame, SACFT1 detects with the translation filter [N.sub.S] x [N.sub.A] times repeatedly, and trains the translation filter once; while detection and training of translation filter, scale filter and aspect ratio filter in SACFT2 are all performed once. Thus disregarding the dimension and channels of input features of different filters, the SACFT2 is approximately [N.sub.S] x [N.sub.A] /3 times faster than SACFT1. However, the inputs of scale filter and aspect ratio filter are formed by concatenating features extracted from image patches with different scales/aspect ratios, which increase the channels of the new feature and complicate the computation a little. Thus the speed up may be smaller than [N.sub.S] x [N.sub.A] /3 in practice. This is validated by the results in Table 1, where [N.sub.S] and [N.sub.A] all equal to 7 in our experimental setup.

4.3 Experiments using the VOT2015 dataset [28]

Started from 2013, the annually held Visual Object Tracking challenge aims to evaluate the performance of single-target short-term model-free visual trackers. VOT2015 [28] contains 60 fully annotated sequences and the results of 62 trackers are reported. The readers can refers to [28] for detailed information of these trackers. The evaluation measures used in VOT2015 [28] are the accuracy and robustness. The former one is calculated by averaging the overlap ratios between the estimated bounding boxes of a tracker and the ground truths, and higher score indicates better performance. The latter one is obtained by computing the probability of a tracker not failing after 100 frames, and higher score indicates better performance.

In the experiments, we only compare SACFT2 with the 62 trackers because SACFT1 is relatively slow to be of practical use. The results are shown in Fig. 11 by using the accuracy-robustness (AR) score plot and AR rank plot. For the AR score plot, the vertical axis denotes the accuracy, and the horizontal axis denotes the robustness. While for the AR rank plot, the coordinates of a tracker corresponds to its ranking with respect to accuracy and robustness. And the closer to the top-right corner of the plots, the better performance a tracker is. As shown in Fig. 11, SACFT2 ranks high in accuracy and outperforms VOT2014 winners SAMF [24] and DSST [25].

The AR ranks of some methods are shown in Table 2. The smaller Accuracy rank is, the better a tracker performs in terms of accuracy. While the smaller Robustness rank is, the better a tracker performs in terms of robustness. MDNet [30] is the winner of VOT2015 [28] but is one of the slowest trackers due to the CNNs used. ASMS [31] is considered to be the best tracker with real time performance (faster than 20fps). The proposed SACFT2 achieves high accuracy ranking and decent robustness ranking, and it can be run at the speed of 43 fps on average.

5. Conclusion

In this paper, we propose two novel size aware correlation filter trackers (SACFTs). The SACFTs not only determine the translation and scale variations, but also take the aspect ratio changes into consideration, thus a better estimation of the size of the targets can be realized, which improves the overall tracking performance. Experiments performed on two large scale datasets validate their effectiveness, and competing results can be achieved compared with state-of-the-art methods, especially the SACFT2 achieves high tracking accuracy with real time frame rate.

References

[1] Y. Wu, J. Lim, and M. H. Yang, "Online object tracking: a benchmark," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411-2418, June 23-28, 2013. Article (CrossRef Link)

[2] D. A. Ross, J. Lim, R. S. Lin, and M. H. Yang, "Incremental learning for robust visual tracking," International Journal of Computer Vision, Vol. 77, No. 1, pp. 125-141, May, 2008. Article (CrossRef Link)

[3] X. Mei, and H. Ling, "Robust visual tracking using [??]1 minimization," in Proc. of the IEEE International Conference on Computer Vision, pp. 1436-1443, September 29-October 2, 2009. Article (CrossRef Link)

[4] K. Zhang, L. Zhang, and M. H. Yang, "Real-time compressive tracking," in Proc. of the European Conference on Computer Vision, pp. 864-877, October 7-13, 2012. Article (CrossRef Link)

[5] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "High-speed tracking with kernelized correlation filters," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, No. 3, pp. 583-596, August, 2014. Article (CrossRef Link)

[6] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "Exploiting the circulant structure of tracking-by-detection with kernels," in Proc. of the European conference on computer vision, pp. 702-715, October 7-13, 2012. Article (CrossRef Link)

[7] S. Avidan, "Support vector tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 8, pp. 1064-1072, August, 2004. Article (CrossRef Link)

[8] S. Hare, A. Saffari, and P. H. S. Torr, "Struck: Structured output tracking with kernels," in Proc. of the IEEE International Conference on Computer Vision, pp. 263-270, November 6-13, 2011. Article (CrossRef Link)

[9] H. Grabner, M. Grabner, and H. Bischof, "Real-time tracking via on-line boosting," in Proc. of the British Machine Vision Conference, pp. 47-56, September 4-7, 2006. Article (CrossRef Link)

[10] H. Grabner, C. Leistner, and H. Bischof, "Semi-supervised on-line boosting for robust tracking," in Proc. of the European Conference on Computer Vision, pp. 234-247, October 12-18, 2008. Article (CrossRef Link)

[11] B. Babenko, M. H. Yang, and S. Belongie, "Visual tracking with online multiple instance learning," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 983-990, June 20-25, 2009. Article (CrossRef Link)

[12] Z. Kalal, K. Mikolajczyk, and J. Matas, "Tracking-learning-detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No. 7, pp. 1409-1422, August, 2011. Article (CrossRef Link)

[13] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, "On-line random forests," in Proc. of the International Conference on Computer Vision Workshops, pp. 1393-1400, September 27 - October 4, 2009. Article (CrossRef Link)

[14] N. Wang, J. Wang, and D. Y. Yeung, "Online robust non-negative dictionary learning for visual tracking," in Proc. of the IEEE International Conference on Computer Vision, pp. 657-664, December 1-8, 2013. Article (CrossRef Link)

[15] N. Wang, and D. Y. Yeung, "Learning a deep compact image representation for visual tracking," in Proc. of the Neural Information Processing Systems 2013, pp. 809-817, 2013. Article (CrossRef Link)

[16] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, "Visual tracking: An experimental survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, No. 7, pp. 1442-1468, July, 2014. Article (CrossRef Link)

[17] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, "Visual object tracking using adaptive correlation filters," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544-2550, June 13-18, 2010. Article (CrossRef Link)

[18] B. Scholkopf, and A. J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond, MIT press, 2002. Article (CrossRef Link)

[19] Z. Chen, Z. Hong, and D. Tao, "An experimental survey on correlation filter-based tracking," Japanese Circulation Journal, Vol. 53, No. 6025, pp. 68-83, 2015. Article (CrossRef Link)

[20] N. Dalal, and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 886-893, June 20-25, 2005. Article (CrossRef Link)

[21] M. Danelljan, F. S. Khan, M. Felsberg, and J. v. d. Weijer, "Adaptive color attributes for real-time visual tracking," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090-1097, June 23-28, 2014. Article (CrossRef Link)

[22] J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus, "Learning color names for real-world applications," IEEE Transactions on Image Processing, Vol. 18, No. 7, pp. 1512-1523, July, 2009. Article (CrossRef Link)

[23] M. Danelljan, G. Hager, F. Khan, and M. Felsberg, "Convolutional features for correlation filter based visual tracking," in Proc. of the IEEE International Conference on Computer Vision Workshops, pp. 58-66, December 7-13, 2015. Article (CrossRef Link)

[24] Y. Li, and J. Zhu, "A scale adaptive kernel correlation filter tracker with feature integration," in Proc. of the European Conference on Computer Vision Workshops, pp. 254-265, September 6-7, 2014. Article (CrossRef Link)

[25] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, "Accurate scale estimation for robust visual tracking," in Proc. of the British Machine Vision Conference, September, 2014. Article (CrossRef Link)

[26] M. Kristan, J. Matas, A. Leonardis, and T. Vojir, "A novel performance evaluation methodology for single-target trackers," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. Article (CrossRef Link)

[27] M. Zhang, J. Xing, J. Gao, and X. Shi, "Joint scale-spatial correlation tracking with adaptive rotation estimation," in Proc. of the IEEE International Conference on Computer Vision Workshop, pp. 595-603, December 7-13, 2015. Article (CrossRef Link)

[28] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, "The visual object tracking vot2015 challenge results," in Proc. of the IEEE International Conference on Computer Vision Workshops, pp. 1-23, December 7-13, 2015. Article (CrossRef Link)

[29] B. V. K. V. Kumar, J. A. Fernandez, A. Rodriguez, and V. N. Boddeti, "Recent advances in correlation filter theory and application," in Proc. of the SPIE Defense + Security, pp. 909404-909413, May 5, 2014. Article (CrossRef Link)

[30] H. Nam, and B. Han, "Learning multi-domain convolutional neural networks for visual tracking," CoRR, 2015. Article (CrossRef Link)

[31] T. Vojir, J. Noskova, and J. Matas, "Robust scale-adaptive mean-shift for tracking," Pattern Recognition Letters, Vol. 49, pp.250-258, November, 2014. Article (CrossRef Link)

Xiaozhou Zhu (1), Xin Song (1), Xiaoqian Chen (1), Yuzhu Bai (1) and Huimin Lu (2)

(1) College of Aerospace Science and Engineering, National University of Defense Technology Changsha, 410073, P. R. China

[e-mail: work_ranger@163.com, song_xin@139.com, baiyuzhu@hotmail.com, chenxiaoqian@nudt.edu.cn]

(2) College of Mechatronics and Automation, National University of Defense Technology Changsha, 410073, P. R. China

[e-mail: lhmnew@nudt.edu.cn]

(*) Corresponding author: Yuzhu Bai

Received August 7, 2016; revised November, 2016; revised December 1, 2016; accepted December 28, 2016; published February 28, 2017

Xiaozhou Zhu received the M.S degree in control science and engineering from National University of Defense Technology, China in 2012. He is currently a Ph.D student in the College of Aerospace Science and Engineering at National University of Defense Technology. His research interests include visual tracking, and robot vision.

Xin Song is currently an associate researcher in the College of Aerospace Science and Engineering at National University of Defense Technology, China. His research interests include robot vision, and star tracker.

Xiaoqian Chen is currently a professor in the College of Aerospace Science and Engineering at National University of Defense Technology, China. His recent research interests include aeronautical engineering, optimization, robot vision, etc.

Yuzhu Bai is currently an associate researcher in the College of Aerospace Science and Engineering at National University of Defense Technology, China. His research interests include orbit dynamics, and robot vision.

Huimin Lu received the Ph.D. degree in control science and engineering from the National University of Defense Technology, China in 2010. He is currently an associate professor in the College of Mechatronics and Automation at National University of Defense Technology. His recent research interests include robot vision, omnidirectional vision, and robot soccer.

Table 1. The average frame rates of different methods KCF [5] SAMF [24] SACFT1 DSST [25] SACFT2 frame 167fps 8fps 2fps 27fps 31fps rate Struck [8] TLD [12] frame 28fps 20fps rate Table 2. The AR ranks of different methods MDNet [30] ASMS [31] SACFT2 SAMF [24] DSST [25] Accuracy rank 1.00 6.83 1.17 3.00 3.83 Robustness rank 1.33 9.83 13.83 14.67 22.33

Printer friendly Cite/link Email Feedback | |

Author: | Zhu, Xiaozhou; Song, Xin; Chen, Xiaoqian; Bai, Yuzhu; Lu, Huimin |
---|---|

Publication: | KSII Transactions on Internet and Information Systems |

Article Type: | Report |

Date: | Feb 1, 2017 |

Words: | 7151 |

Previous Article: | Misclassified samples based hierarchical cascaded classifier for video face recognition. |

Next Article: | Privacy-preserving outsourcing schemes of modular exponentiations using single untrusted cloud server. |

Topics: |