Printer Friendly

A New Fuzzy Logic-Based Similarity Measure Applied to Large Gap Imputation for Uncorrelated Multivariate Time Series.

1. Introduction

Nowadays huge time series can be considered due to the availability of effective low-cost sensors, the wide deployment of remote sensing systems, internet based measure networks, etc. However, collected data are often incomplete for various reasons such as sensor errors, transmission problems, incorrect measurements, bad weather conditions (outdoor sensors), for manual maintenance, etc. This is particularly the case for marine samples [1] that we consider in this paper. For example, the MAREL-Carnot database characterizes sea water in the eastern English Channel, in France [2]. The data contain nineteen time series that are measured by sensors every 20 minutes as nitrate, fluorescence, phosphate, pH, and so on. The analysis of these data with remarkable size and shape allows sea biologists to reveal events such as algal blooms, understand phytoplankton processes [3] in detail, or detect sea pollution and so on. But the data have a lot of missing values: 62.2% for phosphate, 59.9% for nitrate, 27.22% for pH, etc., and the size of missing data varies from one-third hour to several months.

Most proposed models for multivariate time series analysis often have difficulties processing incomplete datasets, despite their powerful techniques. They usually require complete data. Then the question is how can missing values be dealt with? Ignoring or deleting is a simple way to solve this drawback. But serious problems regularly arise when applying this solution. This is prominent in time series data where the considered values depend on the previous ones. Furthermore, an analysis based on the systematic differences between observed and unobserved data leads to biased and unreliable results [4]. Thus, it is important to propose a new technique to estimate the missing values. The imputation technique is a conventional method to handle incompleteness problems [5].

Considering imputation methods for multivariate time series, taking advantage of the correlations between variables is commonly applied to predict lacking data [6-11]. This means that relations permit using the values of available features to estimate the missing values of other features. However, considering multivariate datasets having low/noncorrelations (for instance the MAREL-Carnot dataset), the observed values of full variables cannot be utilized to complete attributes containing missing values. To handle missing data in this case, we must employ the observed values of the unique variable with the missing data to compute the incomplete values. Therefore the proposed method has to manage the high level of uncertainty of this kind of signal.

Particularly, imperfect time series can be modelled using fuzzy sets. The fuzzy approach makes it possible to handle incomplete data, vague, and imprecise circumstances [12], which provide a high uncertainty environment to make decision. This property enables modelling and short-term forecasting of traffic flow in urban arterial networks using multivariate traffic data [13, 14]. Recent works to urban traffic flow prediction [15] and to lane-changes prediction [16] have been proposed with success. Furthermore, the successful use of fuzzy-based similarity measure in pattern recognition [17], in retrieval systems [12], and in recommendation systems [18] leads us to study its ability to complete missing values in uncorrelated multivariate time series. Wang et al. [19] proposed using information granules and fuzzy clustering for time series long-term forecasting. But according to our knowledge, there is no application devoted to complete large gap(s) in uncorrelated multivariate time series using a fuzzy-weighted similarity measure.

Thus, this paper aims to propose a new approach, named FSMUMI, to fill large missing values in low/uncorrelated multivariate time series by developing a new similarity measure based on fuzzy logic. However, estimating the distribution of missing values and whole signals is very difficult, so our approach makes an assumption of effective patterns (or recurrent data) on each signal.

The rest of this paper is organized as follows. In Section 2, related works to imputation methods and fuzzy similarity measure are reviewed. Section 3 introduces our approach for completing large missing subsequences in low/uncorrelated multivariate time series. Next, Section 4 demonstrates our experimental protocol for the imputation task. Section 5 presents results and discussion. Conclusions are drawn and future work is presented in the last section.

2. Related Works

This section presents, first, related work about multivariate imputation methods, followed by a review on the fuzzy similarity measure and its applications.

2.1. Classical Multivariate Imputation Methods. Up to now, numerous successful researches have been devoted to complete missing data in multivariate time series imputation such as [10, 11, 20-28]. Imputation techniques can be categorized in different perspectives: model-based or machine learning-based and clustering-based imputation techniques.

In view of the model-based imputation, two main methods were proposed. The first method was introduced by Schafer [20]. With the hypothesis that all variables follow a multivariate normal distribution, this approach is based on the multivariate normal (MVN) model to determine completion values. And, the second method, namely, MICE, was developed by van Buuren et al. [21] and Raghunathan et al. [22]. This method uses chained equations to fill in incomplete data: for each variable with missing values, MICE computes the imputation data by exploiting the relations between all other variables.

According to the concept of machine learning-based imputation, many studies focus on completion of missing data in multivariate time series. Stekhoven and Buhlmann [6] implemented missForest based on the Random Forest (RF) method for multivariate imputation. P.Bonissone et al. [29] proposed a fuzzy version of RF that they named fuzzy random forest FRF. At the moment FRF is only devoted to classification and in our case FRF may be only interesting to separate correlated and uncorrelated variables in multivariate time series if necessary. In [25], Shah et al. investigated a variant of MICE which fills in each variable using the estimation generated from RF. The results showed that the combination of MICE and RF was more efficient than original methods for multivariate imputation. K-Nearest Neighbors (k-NN)-based imputation is also a popular method for completing missing values such as [11, 26, 27, 30-32]. This approach identifies fc most similar patterns in the space of available features to impute missing data.

Besides these principal techniques, clustering-based imputation approaches are considered as power tools for completing missing values thanks to their ability to detect similar patterns. The objective of these techniques is to separate the data into several clusters when satisfying the following conditions: maximizing the intercluster similarity and minimizing intracluster dissimilarity. Li et al. [33] proposed the k-means clustering imputation technique that estimates missing values using the final cluster information. The fuzzy c-means (FcM) clustering is a common extension of k-means. The squared-norm is applied to measure the similarity between cluster centers and data points. Different applications based on FcM are investigated for the imputation task as [7-9, 34-38]. Wang et al. [19] used FcM based on DTW to successfully predict time series in long-term forecasting.

In general, most of the imputation algorithms for multivariate time series take advantage of dependencies between attributes to predict missing values.

2.2. Methods Based on Fuzzy Similarity Measure. Indeed similarity-based approaches are a promising tool for time series analysis. However, many of these techniques rely on parameter tuning, and they may have shortcomings due to dependencies between variables. The objective of this study is to fill large missing values in uncorrelated multivariate time series. Thus, we have to deal with a high level of uncertainty. Mikalsen et al. [39] proposed using GMM (Gaussian mixture models) and cluster kernel to deal with uncertainty. Their method needs ensemble learning with numerous learning datasets that are not available in our case at the moment (marine data). So we have chosen to model this global uncertainty using fuzzy sets (FS) introduced by Zadeh [40]. These techniques consider that measurements have inherent vagueness rather than randomness.

Uncertainty is classically presented using three conceptually distinctive characteristics: fuzziness, randomness and incompleteness. This classification is interesting for many applications, like sensor management (image processing, speech processing, and time series processing) and practical decision-making. This paper focuses on (sensor) measurements treatment but is also relevant for other applications.

Incompleteness often affects time series prediction (time series obtained from marine data such as salinity and temperature). So it seems natural to use fuzzy similarity between subsequences of time series to deal with these three kinds of uncertainties (fuzziness, randomness, and incompleteness). Fuzzy sets are now well known and we only need to remind the basic definition of "FS." Considering the universe X, a fuzzy set A [member of] X is characterized using a fuzzy membership function [[mu].sub.A]:

[[mu].sub.A]: X [right arrow] [0, 1], (1)

where [[mu].sub.A](x) represents the membership of x to A and is associated to the uncertainty of x. In our case, we will consider similarity values between the subsequences as defined in the following. One solution to deal with uncertainty brought by multivariate time series is to use the concept of fuzzy time series [41]. In this framework, the variable observations are considered as fuzzy numbers instead of real numbers. In our case the same modelling is used considering distance measures between subsequences and then we compute the fuzzy similarity between these subsequences to find similar windows in order to estimate the missing values in observations.

Fuzzy similarity is a generalization of the classical concept of equivalence and defines the resemblance between two objects (here subsequences of time series). Similarity measures of fuzzy values have been compared in [42] and have been extended in [43]. In [42], Pappis and Karacapilidis presented three main kinds of similarity measures of fuzzy values, including

(i) measures based on the operations of union and intersection,

(ii) measures based on the maximum difference,

(iii) measures based on the difference and the sum of membership grades.

In [44, 45], the authors used these definitions to propose a distance metric for a space of linguistic summaries based on fuzzy protoforms. Almeida et al. extended this work to put forward linguistic summaries of categorical time series [46]. The introduced similarity measure takes into account not only the linguistic meaning of the summaries but also the numerical characteristic attached to them. In the same way, Gupta et al. [12] introduced this approach to create a hybrid similarity measure based on fuzzy logic. The approach is used to retrieve relevant documents. In the other research, Al-shamri and Al-Ashwal presented fuzzy weightings of popular similarity measures for memory-based collaborative recommend systems [18].

Concerning the similarity between two subsequences of time series, we can use the DT W cost as a similarity measure. However, to deal with the high level of uncertainty of the processed signals, numerous similarity measures can be used to compute similarity like the cosine similarity, Euclidean distance, Pearson correlation coefficient. Moreover, a fuzzy-weighted combination of scores generated from different similarity measures could comparatively achieve better retrieval results than the use of a single similarity measure [12, 18].

Based on the same concepts, we propose using a fuzzy rules interpolation scheme between grades of membership of fuzzy values. This method makes it possible to build a new hybrid similarity measure for finding similar values between subsequences of time series.

3. Proposed Approach

The proposed imputation method is based on the retrieval and the similarity comparison of available subsequences. In order to compare the subsequences, we create a new similarity measure applying a multiple fuzzy rules interpolation. This section is divided into two parts. Firstly, we focus on the way to compute a new similarity measure between subsequences. Then, we provide details of the proposed approach (namely, Fuzzy Similarity Measure Based Uncorrelated Multivariate Imputation, FSMUMI) to impute the successive missing values of low/uncorrelated multivariate time series.

3.1. Fuzzy-Weighted Similarity Measure between Subsequences. To introduce a new similarity measure using multiple fuzzy rules interpolation to solve the missing problem, we have to define an information granule, as introduced by Pedrycz [47]. The principle of justifiable granularity of experimental data is based on two conditions: (i) the numeric evidence accumulated within the bounds of numeric data has to be as high as possible and, (ii) at the same time, the information granule should be as specific as possible [19].

To answer the first condition, we take into account 3 different distance measures between two subsequences Q (Q = [[q.sub.i], i = 1, ..., T}) and R (R = {[r.sub.i] = 1, ..., T}) including Cosine distance, Euclidean distance (these two measures are widely used in the literature), and Similarity distance (this one was presented in our previous study [48]). These three measures are defined as follows:

(i) Cosine distance is computed by (2). This coefficient presents the cosine of the angle between Q and R

Cosine (Q, R) = [[summation].sup.T.sub.i=1] [q.sub.i] x [r.sub.i] / [[summation].sup.T.sub.i=1] [([q.sub.i]).sup.2] x [[summation].sup.T.sub.i=1] [([r.sub.i]).sup.2] (2)

(ii) Euclidean distance is calculated by

[ED.sup.*] (Q,R) = [square root of ([T.summation over (i=1)] [([q.sub.i] - [r.sub.i]).sup.2])] (3)

To satisfy the input condition of fuzzy logic rules, we normalize this distance to [0,1] by this function ED = 1/(1 + [ED.sup.*] (q, r)).

(iii) Similarity measure is defined by the function (4). This measure indicates the similarity percentage between Q and R

Sim (Q, R) = [1/T] [T.summation over (i=1)] [1 + [absolute value of ([q.sub.i] - [r.sub.i])] / (max (Q) - min (Q)) (4)

To answer the second condition, we use these 3 distance measures (or attributes) to generate 4 fuzzy similarities (see Figure 2), then applied to a fuzzy inference system (see Figure 1) using the cylindrical extension of the 3 attributes which provides 3 coefficients to calculate a new similarity measure. The universe of discourse of each distance measure is normalized to the value 1.

And, finally, the new similarity measure is determined by

FBSM = w1 * Cosine (Q, R)+ w2 * ED (Q, R) + w3 * Sim (Q, R) (5)

where w1, w2, and w3 are the weights of the Cosine, ED, and Sim measures, respectively. Thus uncertainty modelled using FS is kept during the similarity computation and makes it possible to deal with a high level of uncertainty as shown in the sequel. The coefficients wi are generated from the fuzzy interpolation system (Figure 1). We use FuzzyR Rpackage [49] to develop this system. All input and output variables are expressed by 4 linguistic terms as low, medium, medium-high, and high. A trapezoidal membership function is handled in this case to match input and output spaces to a degree of membership (Figure 2). The multiple rules interpolation is applied to create the fuzzy rules base. So, 64 fuzzy rules are introduced. Each fuzzy rule is presented in the following form:

Rule R: IF (Cosine is Zvl) and (ED is lv2) and (Sim is lv3) THEN (wl is Iwl) and (u>2 is lw2) and (w3 is lw3) in which Ivi, Iwi [member of] {low, medium, medium-high, high}, and i = 1,2, 3.

3.2. FSMUBI Approach. Let us consider some notations about multivariate time series and the concept of large gap. A multivariate time series is represented as a matrix [X.sub.NxM] with M collected signals of size N. x(t, i) is the value of the ith signal at time t. [x.sub.t] = {x(t, i), i = 1, ..., M} is the feature vector at the t-th observation of all variables. X is called an incomplete time series when it contains missing values. We define the term gap of T-size at position t as a portion of X where at least one signal of X between t and t + T-1 contains consecutive missing values ([there exists]i | [for all]t [member of] [t,t + T - 1], x(t, i) = NA).

Here, we deal with large missing values in low/uncorrelated multivariate time series. For isolated missing values (T = 1) or small T-gap, conventional techniques can be applied such as the mean or the median of available values [50, 51]. A T-gap is large when the duration T is longer than known change process. For instance, in phytoplankton study, T is equal to one hour to characterize Langmuir cells and one day for algal bloom processes [52]. For small time series (N < 10, 000) without prior knowledge of an application and its change process, we set a large gap when T [greater than or equal to] 5%N.

The mechanism of FSMUMI approach is demonstrated in Figure 3. Without loss of generality, in this figure, we consider a multivariate time series including 3 variables whose correlations are low. The proposed approach involves three major stages. The first stage is to build two queries Qa and Qb. The second stage is devoted to find the most similar windows to the queries. This stage includes two minor steps, comparing sliding windows to queries by using the new similarity measure and selecting the similar windows Qbs and Qas. Finally, the imputation values are computed by averaging values of the window following Qbs and the one preceding Qas to complete the gap.

This method concentrates on filling missing values in low/uncorrelated multivariate time series. For this type of data, we cannot take advantage of the relations between features to estimate missing values. So we must base our approach on observed values on each signal to complete missing data on itself. This means that we can complete missing data on each variable, one by one. Further, an important point of our approach is that each incomplete signal is processed as two separated time series, one time series before the considered gap and one time series after this gap. This allows increasing the search space for similar values. Moreover, applying the proposed process (one by one), FSMUMI makes it possible to handle the problem of wholly missing variables (missing data at the same index in the all variables).

The proposed model is described in Algorithm 1 and is mainly divided into three phases:

(i) The first phase: Building queries (cf. 1 in Figure 2)

For each incomplete signal and each T-gap, two referenced databases are extracted from the original time series and two query windows are built to retrieve similar windows. The data before the gap (noted Db) and the data after this gap (denoted Da) are considered as two separated time series. We noted Qb is the subsequence before the gap and Qa is the respective subsequence after the gap. These query windows have the same size T as the gap.

(ii) The second phase: Finding the most similar windows (cf. 2 and 3 in Figure 2)

For the Db database, we build sliding reference windows (noted R) of size T. From these R windows, we retrieve the most similar window (Qbs) to the Qb query using the new similarity measure fbsm as previously defined in Section 3.1. Details are in the following:

We first find the threshold, which allows considering two windows to be similar. For each increment step_threshold, we compute a fbsm similarity measure between a sliding window R and the query Qb. The threshold is the maximum value obtained from the all fbsm calculated (Step a: in Algorithm 1).

We then find the most similar window to the query Qb. For each increment similar window step_sim_win, a fbsm of a R sliding reference and the query Qb is estimated. We then compare this fbsm to the threshold to determine if this R reference is similar to the query Qb. We finally choose the most similar window Qbs with the maximum fbsm of all the similar windows (Step b: in Algorithm 1).

The same process is performed to find the most similar window Qas in Da data.

In the proposed approach, the dynamics and the shape of data before and after a gap are a key-point of our method. This means we take into account both queries Qa (after the gap) and Qb (before the gap). This makes it possible to find out windows that have the most similar dynamics and shape to the queries.

(iii) The third phase (cf. 4 in Figure 2)

When results from both referenced time series are available, we fill in the gap by averaging values of the window preceding Qas and the one following Qbs. The average values are used in our approach because model averaging makes the final results more stable and unbiased [53].
Algorithm 1: FSMUMI algorithm.

Input: X = {[x.sub.1],[x.sub.2], ..., [x.sup.M]}: incomplete
   uncorrelated multivariate time series
 N: size of time series.
 t: index of a gap (position of the first missing of the gap)
 T: size of the gap
   step-threshold: increment for finding a threshold
   stepsim.win: increment for finding a similar window
Output: Y--completed (imputed) time series
(1) for each incomplete signal [x.sup.j] [member of] X do
(2)  for each gap at t index in [x.sup.j] do
(3)  Divide [x.sup.j] into two separated time series Da, Db:
     Da = [x.sup.j][t + T : N],Db = [x.sup.j][1 : t - 1]
(4)  Completing all lines containing missing parameter on Da, Db by
     a max trapezoid function
(5)  Construct queries Qa, Qb-temporal windows after and before the
     gap Qa = Da[1 : T], Qb = Db[t - T + 1 : t - 1]
(6)  for Db data do
(7)   Step a: Find the threshold in the Db database
(8)   i [left arrow] 1; FSM [left arrow] NULL
(9)   while i [less than or equal to] length(Db) do
(10)    k [left arrow] i + T - 1
(11)    Create a reference window: R(i) = Db[i: k]
(12)    Calculate a fuzzy-based similarity measure between Qb and
        R(i): fbsm
(13)    Save the fbsm to FMS
(14)    i [left arrow] i + step_threshold
(15)   end while
(16)   return threshold = max{KBMS}
(17)   Step b: Find similar windows in the Db database
(18)   i [left arrow] 1; Lopb [left arrow] NULL
(19)   while i [less than or equal to] length(Db) do
(20)    k [left arrow] i + T - 1
(21)    Create a reference window: R(i) = Db[i: k]
(22)    Calculate a fuzzy-based similarity measure between Qb and
        R(i): fbsm
(23)    if fbsm [right arrow] threshold then
(24)      Save position of R(i) to Lopb
(25)     end if
(26)     i [left arrow] i + step_sim_win
(27)    end while
(28)    return position of Qbs--the most similar window to Qb having
        the maximum fuzzy similarity measure in the Lopb list.
(29)    end for
(30)    for Da data do
(31)     Perform Step a and Step b for Da data
(32)     return position of Qas--the most similar window to Qa
(33)    end for
(34)    Replace the missing values at the position t by average vector
        of the window after Qbs and the one previous Qas
(35)   end for
(36) end for
(37) return Y--imputed time series


4. Experiment Protocol

The experiments are performed on three multivariate time series with the same experiment process and the same gaps, described in detail below.

4.1. Datasets Description. For the assessment of the proposed approach and the comparison of its performance to several published algorithms, we use 3 multivariate time series, one from UCI Machine Learning repository, one simulated dataset (this allows us to handle the correlations between variables and percentage of missing values), and finally a real time series hourly sampled by IFREMER (France) in the eastern English Channel.

(i) Synthetic dataset [54]: The data are synthetic time series, including 10 features, 100,000 sampled points. All data points are in the range -0.5 to +0.5. The data appear highly periodic but never exactly repeated. They have structure at different resolutions. Each of the 10 features is generated by independent invocations of the function:

y = [7.summation over (i=3)] [1/[2.sup.i]] sin (2[pi] ([2.sup.2+i] + rand ([2.sup.i]))t); 0 [less than or equal to] t [less than or equal to] 1 (6)

where rand(x) produces a random integer between 0 and x.

These data are very large so we choose only a subset of 3 signals for performing experiments.

(ii) Simulated dataset: In the second experiment, a simulated dataset including 3 signals is produced as follows: for the first variable, we use 5 sine functions that have different frequencies and amplitudes F = {[f.sub.1], [f.sub.2], [f.sub.3], [f.sub.4], [f.sub.5]}. Next, 3 various noise levels are added to data F, S = {F,F + noisel, F + noise2, F + noise3}. We then repeat S 4 times (this dataset has 32,000 sampled points). In this study, we treat with missing data in low/uncorrelated multivariate time series. So to satisfy this condition, the two remaining signals are generated based on the first signal with the correlations between these signals are low ([less than or equal to] 0.1%). We apply the Corgen function of ecodist R-package [55] to create the second and the third variables.

(iii) MAREL-Carnot dataset [2]: The third experiment is conducted on MAREL-Carnot dataset. This dataset consists of nineteen series such as phosphate, salinity, turbidity, water temperature, fluorescence, and water level that characterize sea water. These signals were collected from the 1st January 2005 to the 9th February 2009 at a 20 minute frequency. Here they were hourly sampled, so they have 35,334 time samples. But the data include many missing values, the size of missing data varying on each signal. To assess the performance of the proposed method and compare it with other approaches, we choose a subgroup including fluorescence, water level, and water temperature (the water level and the fluorescence signals are completed data, while water temperature contains isolated missing values and many gaps). We selected these signals because their correlations are low.

After completing missing values, completion data will be compared with the actual values in the completed series to evaluate the ability of different imputation methods. Therefore, it is necessary to fill missing values in the water temperature. To ensure the fairness of all algorithms, filling in the water temperature series is performed by using the na.interp method ([56]).

4.2. Multivariate Imputation Approaches. In the present study, we perform a comparison of the proposed algorithm with 7 other approaches (comprising Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) for the imputation of multivariate time series. We use R language to execute all these algorithms.

(1) Amelia II (Amelia II R-package) [57]: The algorithm uses the familiar expectation-maximization algorithm on multiple bootstrapped samples of the original incomplete data to draw values of the complete data parameters. The algorithm then draws imputed values from each set of bootstrapped parameters, replacing the missing values with the drawn values.

(2) FcM-Fuzzy c-means based imputation: This approach involves 2 steps. The first step is to group the whole data into k clusters using fuzzy-c means technique. A cluster membership for each sample and a cluster center are generated for each feature. The second step is to fill in the incomplete data by using the membership degree and the center centroids [33]. We base on the principles of [33] and use the c-means function [58] to develop this approach.

(3) MI: Multiple Imputation (MI R-package) [59]: This method uses predictive mean matching to estimate missing values of continuous variables. For each missing value, its imputation value is randomly selected from a set of observed values that are the closest predicted mean to the variable with the missing value.

(4) MICE: Multivariate Imputation via Chained Equations (MICE R-package) [60]: For each incomplete variable under the assumption of MAR (missing at random), the algorithm performs a completion by full conditional specification of predictive models. The same process is implemented with other variables having missing data.

(5) missForest (missForest R-package) [6]: This algorithm uses random forest method to complete missing values. For each variable containing missing data, missForest builds a random forest model on the available data. To estimate missing data this model is applied in the variable, repeating the procedure until it meets a stopping condition.

(6) Linear interpolation: na.approx (zoo R-package) [61]: This method is based on an interpolation function to predict each missing point.

(7) DTWUMI [62]: For each gap, this approach finds the most similar window to the subsequence after (resp. before) the gap based on the combination of shape-features extraction and Dynamic Time Warping algorithms. Then, the previous (resp. following) window of the most similar one in the incomplete signal is used to complete the gap.

4.3. Imputation Performance Measurements. In order to estimate the quantitative performance of imputation approaches, six usual criteria in the literature are used as follows:

(1) Similarity evaluates the similar percent between the estimated values (y) and the respective real values (x). This index is defined by

Sim (y, x) = [1 / T] [T.summation over (i=1)] [1 / 1 + [absolute value of [y.sub.i] - [x.sub.i]] / (max (x) - min (x))] (7)

where T is the number of missing values. The similarity tends to 1 when the two curves are identical and tends to 0 when the amplitudes are strongly different.

(2) [R.sup.2] score is determined as the square of correlation coefficient between two variables y and x. This indicator makes it possible to assess the quality of an imputation model. A method presents better performance when its score is higher ([R.sup.2] [member of] [0, 1])

(3) RMSE (Root Mean Square Error) is computed as the average squared difference between y and x. This is an appreciate coefficient to measure global ability of a completion method. In general, a lower RMSE highlights a better imputation performance.

RMSE(y, x) = [square root of ([1 / T] [T.summation over (i=1)] [([y.sub.i] - [x.sub.i]).sup.2])] (8)

It is now well admitted that good imputation performance does not lead automatically to good estimation performance. It is why other indices like FSD, FA2, and FB (that enable evaluating the shape of the two signals) are used in this study.

(4) FSD (Fraction of Standard Deviation) is defined as

FSD (y, x) = 2 * [absolute value of (SD (y) - SD (x))]/SD (y) + SD (x) (9)

This fraction points out whether a method is acceptable or not. Applying to the imputation task, when FSD value approaches 0, an imputation method is impeccable.

(5) FB: Fractional Bias: determines the rate of predicted values y are overestimated or underestimated relative to observed values x. This indicator is given by (10). An imputation model is considered ideal as its FB equals 0.

FB (y, x) = 2 * [absolute value of (mean (y) - mean (x)/ mean (y) + mean (x))] (10)

(6) FA2 defines the percentage of outlier between two variables y and x. It is described by

FA2(y, x) = length (0.5 [less than or equal to] y/x [less than or equal to] 2) / length (x) (11)

When FA2 value is close to 1, a model is considered perfect.

4.4. Experimental Process. Indeed, evaluating the ability of imputation methods cannot be done because the actual values are lacking. So we must produce artificial missing data on completed time series in order to compare the performance of imputation approaches. We use a technique based on three steps to assess the results detailed in the following:

(i) The first step: Generate simulated missing values by removing data values from full time series.

(ii) The second step: Apply the imputation methods to fill in missing data.

(iii) The third step: Evaluate the ability of proposed approach and compare with state-of-the-art methods using different performance indices abovementioned.

In this paper, we perform experiments with seven missing data levels on three large datasets. On each signal, we create simulated gaps with different rates ranging from 1%, 2%, 3%, 4%, 5%, 7.5%, to 10% of the data in the complete signal (here the biggest gap of M AREL-Carnot data is 3,533 missing values corresponding to 5 months of hourly sampled). For every missing ratio, the approaches are run 5 times by randomly choosing the positions of missing in the data. We then perform 35 iterations for each dataset.

5. Results and Discussion

This section provides experiment results obtained from the proposed approach and compares its ability with the seven published approaches. Results are discussed in three parts, i.e., quantitative performance, visual performance, and execution times.

5.1. Quantitative Performance Comparison. Tables 1, 2, and 3 illustrate the average ability of various imputation methods for synthetic, simulated, and MAREL-Carnot time series using 6 measurements as previously defined. For each missing level, the best results are highlighted in bold. These results demonstrate the improved performance of FSMUMI to complete missing data in low/uncorrelated multivariate time series.

Synthetic Dataset. Table 1 presents a comparison of 8 imputation methods on synthetic dataset that contains 7 missing data levels (1-10%). The results clearly show that when a gap size is greater than 2%, the proposed method yields the highest similarity, [R.sup.2], FA2, and the lowest RMSE, FB. With this dataset, na.approx gives the best performance at the smallest missing data level for all indices and is ranked second for other ratios of missing values (2-5%) for similarity and FA2, RMSE (2-4%), and [R.sup.2] (the 1st rank at 2% missing rate, the 2nd at 3%, 5%). The results can explain that the synthetic data are generated by a function (6). na.approx method applies the interpolation function to estimate missing values. So it is easy to find a function to generate values that are approximate real values when missing data rates are small. But this work is more difficult when the missing sample size rises; that is why the ability of na.approx decreases as missing data levels increase, especially at 7.5% and 10% rates. Although this dataset never exactly repeats itself and our approach is proposed under the assumption of recurrent data the FSMUMI approach proves its performance for the imputation task even if the missing size increases.

Among the considered methods, the FcM-based approach is less accurate at lower missing rates but it provides better results at larger missing ratios as regards the accuracy indices.

Simulated Dataset. Table 2 illustrates the evaluation results of various imputation algorithms on the simulated dataset. The best values for each missing level are highlighted in bold. Our proposed method outperforms other methods for the imputation task on accuracy indices: the highest similarity, [R.sup.2], and the lowest RMSE at every missing ratio. However, when considering other indices such as FA2, FSD, and FB, FSMUMI no longer shows its performance. It gains only at a 4% rate for the FB index and at 10% ratio for FA2. In contrast to FSMUMI, DTWUMI provides the best results for FSD indicator at all missing levels and FA2 at the first 5 missing ratios (from 1% to 5%).

Different from the synthetic dataset, on the simulated dataset, the FcM-based method is always ranked the third at all missing rates for similarity and RMSE indicators. Following FcM is missForest algorithm for the both indices.

Although, in the second experiment, data are built by various functions but they are quite complex so that na.approx does not provide good results.

MAREL-Carnot Dataset. Once again, as reported in Table 3, our algorithm demonstrates its capability for the imputation task. FSMUMI method generates the best results as regarding accuracy indices for almost missing ratios (excluding at 2% missing level on all indices, and at 5% missing rate on [R.sup.2] score). But when considering shape indicators, FSMUMI only provides the highest FA2 values at several missing levels (3%, 5%-10%). In particular, our method illustrates the ability to fill in incomplete data with large missing rates (7.5% and 10%): the highest similarity, [R.sup.2], FA2, and the lowest RMSE, FSD (excluding at 7.5%), and FB. These gaps correspond to 110.4 and 147.2 days sampled at hourly frequency.

In contrast to the two datasets above, on the MAREL-Carnot data, na.approx indicates quite good results: the permanent second or third rank for the accuracy indices (the 1st order at 5% missing rate on [R.sup.2] score), the lowest FSD (from 3% to 5% missing rates), and FB at some other levels of missing data. But when looking at the shape of imputation values generated from this method, it absolutely gives the worst results (Figure 6).

Other approaches (including FcM-based imputation, MI, MICE, Amelia, and missForest) exploit the relations between attributes to estimate missing values. However, three considered datasets have low correlations between variables (roundly 0.2 for MAREL-Carnot data, [less than or equal to] 0.1 for simulated and synthetic datasets). So these methods do not demonstrate their performance for completing missing values in low/uncorrelated multivariate time series. Otherwise, our algorithm shows its ability and stability when applying to the imputation task for this kind of data.

DTWUMI approach was proposed to fill large missing values in low/uncorrelated multivariate time series. However, this method is not as powerful as the FSMUMI method. DTWUMI only produces the best results at 2% missing level on the MAREL-Carnot dataset and is always at the second or the third rank at all the remaining missing rates on the MAREL-Carnot and the simulated datasets. That is because the DTWUMI method only finds the most similar window to a query either before a gap or after this gap, and it uses only one similarity measure, the DTW cost, to retrieve the most similar window. In addition, another reason maybe that DTWUMI has directly used data from the window following or preceding the most similar window to completing the gap.

5.2. Visual Performance Comparison. In this paper, we also compare the visualization performance of completion values yielded by various algorithms. Figures 4 and 5 illustrate the form of imputed values generated from different approaches on the synthetic series at two missing ratios 1% and 5%.

At a 1% missing rate, the shape of imputation values produced by na.approx method is closer to the one of true values than the form of completion values given by our approach. However, at a 5% level of missing data, this method no longer shows the performance (Figure 5). In this case, the proposed method proves its relevance for the imputation task. The shape of FSMUMI's imputation data is almost similar to the form of true values (Figure 5).

Looking at Figure 6, FSMUMI one more time proves its capability for uncorrelated multivariate time series imputation: completion values yielded by FSMUMI are virtually identical to the real data on the MAREL-Carnot dataset. When comparing DTWUMI with FSMUMI, it is clear that FSMUMI gives improved results (Figures 4, 5, and 6).

5.3. Computation Time. Besides, we perform a comparison of the computational time of each method on the synthetic series (in second - s). Table 4 indicates that na.approx method requires the shortest running time and DTWUMI approach takes the longest computing time. The proposed method, FSMUMI, demands more execution time as missing rates increase. However, considering the quantitative and visual performance of FSMUMI for the imputation task (Table 1, Figures 5 and 6), the required time of the proposed approach is fully acceptable.

6. Conclusion

This paper proposes a novel approach for uncorrelated multivariate time series imputation using a fuzzy logic-based similarity measure, namely FSMUMI. This method makes it possible to manage uncertainty with the comprehensibility of linguistic variables. FSMUMI has been tested on different datasets and compared with published algorithms (Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) on accuracy and shape criteria. The visual ability of these approaches is also investigated. The experimental results definitely highlight that the proposed approach yielded improved performance in accuracy over previous methods in the case of multivariate time series having large gaps and low or non-correlation between variables. However, it is necessary to make an assumption of recurrent data and sufficiently large dataset to apply the algorithm. This means that our approach needs patterns (in our case the two queries (before and after the considered gap)) existing somewhere in the database. This enables us to predict missing values if the patterns occur in the past or in the following data from the considered position. Thus a satisfactory and sufficient dataset (large dataset) is required.

In future work, we plan to (i) combine FSMUMI method with other algorithms such as Random Forest or Deep learning in order to efficiently fill incomplete values in any type of multivariate time series; (ii) investigate this approach applied to short-term/long-term forecasts in multivariate time series. We could also investigate complex fuzzy sets ([63]) instead of ordinary fuzzy sets that have given good results using an adaptive scheme in the case of the bivariate time series with small dataset.

https://doi.org/10.1155/2018/9095683

Data Availability

The data used to support this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was kindly supported by the Ministry of Education and Training Vietnam International Education Development, the French government, and FEDER, the region Hauts-de-France (CPER 2014-2020 MARCO). The experiments were carried out using the CALCULCO computing platform, supported by SCoSI/ULCO (Univ. Littoral).

References

[1] H. T. Ceong, H. J. Kim, and J. S. Park, "Discovery of and Recovery from Failure in a Coastal Marine USN Service," Journal of Information and Communication Convergence Engineering, vol. 10, no. 1, pp. 11-20, 2012.

[2] A. Lefebvre, MAREL Carnot Data and Metadata from Coriolis Data Centre, SEANOE, 2015.

[3] K. Rousseeuw, E. Poisson Caillault, A. Lefebvre, and D. Hamad, "Hybrid Hidden Markov Model for Marine Environment Monitoring," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 1, pp. 204-213, 2015.

[4] G. Hawthorne and P. Elliott, "Imputing cross-sectional missing data: Comparison of common techniques," Australian & New Zealand Journal of Psychiatry, vol. 39, no. 7, pp. 583-590, 2005.

[5] H. Junninen, H. Niska, K. Tuppurainen, J. Ruuskanen, and M. Kolehmainen, "Methods for imputation of missing values in air quality data sets," Atmospheric Environment, vol. 38, no. 18, pp. 2895-2907, 2004.

[6] D. J. Stekhoven and P. Buhlmann, "Missforest--Nonparametric missing value imputation for mixed-type data," Bioinformatics, vol. 28, no. 1, pp. 112-118, 2012.

[7] H. Ichihashi, K. Honda, A. Notsu, and T. Yagi, "Fuzzy c-means classifier with deterministic initialization and missing value imputation," in Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence, FOCI 2007, pp. 214-221, USA, April 2007.

[8] P. Saravanan and P. Sailakshmi, "Missing value imputation using fuzzy possibilistic c means optimized with support vector regression and genetic algorithm," Journal of Theoretical and Applied Information Technology, vol. 72, no. 1, pp. 34-39, 2015.

[9] T. Furukawa, Ohnishi Shin-ichi, Yamanoi Takahiro. Missing Categorical Data Imputation for FCM Clusterings of Mixed Incomplete Data, 2014.

[10] Y. Deng, C. Chang, M. S. Ido, and Q. Long, "Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data," Scientific Reports, vol. 6, Article ID 21689, 2016.

[11] S. Oehmcke, O. Zielinski, and O. Kramer, "kNN ensembles with penalized DTW for multivariate time series imputation," in Proceedings of the 2016 International Joint Conference on Neural Networks, IJCNN 2016, pp. 2774-2781, Canada, July 2016.

[12] Y. Gupta, A. Saini, and A. K. Saxena, "Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval," Journal of Information Science, vol. 40, no. 6, pp. 846-857, 2014.

[13] L. Dimitriou, T. Tsekeris, and A. Stathopoulos, "Adaptive hybrid fuzzy rule-based system approach for modeling and predicting urban traffic flow," Transportation Research Part C: Emerging Technologies, vol. 16, no. 5, pp. 554-573, 2008.

[14] A. Stathopoulos, M. G. Karlaftis, and L. Dimitriou, "Fuzzy rule-based system approach to combining traffic count forecasts," Transportation Research Record, no. 2183, pp. 120-128, 2010.

[15] H. B. Yin, S. C. Wong, J. Xu, and C. K. Wong, "Urban traffic flow prediction using a fuzzy-neural approach," Transportation Research Part C: Emerging Technologies, vol. 10, no. 2, pp. 85-98, 2002.

[16] J. Tang, F. Liu, W. Zhang, R. Ke, and Y. Zou, "Lane-changes prediction based on adaptive fuzzy neural network," Expert Systems with Applications, vol. 91, pp. 452-463, 2018.

[17] S. Shahmoradi and S. Bagheri Shouraki, "Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition," Applied Soft Computing, vol. 62, pp. 315-327, 2018.

[18] M. Y. Al-Shamri and N. H. Al-Ashwal, "Fuzzy-weighted similarity measures for memory-based collaborative recommender systems," Journal of Intelligent Learning Systems and Applications, vol. 6, no. 1, pp. 1-10, 2014.

[19] W. N. Wang, W. Pedrycz, and X. D. Liu, "Time series long-term forecasting model based on information granules and fuzzy clustering," Engineering Applications of Artificial Intelligence, vol. 41, pp. 17-24, 2015.

[20] J. L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, New York, NY, USA, 1997.

[21] S. Van Buuren, H. C. Boshuizen, and D. L. Knook, "Multiple imputation of missing blood pressure covariates in survival analysis," Statistics in Medicine, vol. 18, no. 6, pp. 681-694, 1999.

[22] E. R. Trivellore, M. L. James, H. Van John, and P. Solenberger, "A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models," Survey methodology, vol. 27, no. 1, pp. 85-96, 2001.

[23] J. M. Engels and P. Diehr, "Imputation of missing longitudinal data: A comparison of methods," Journal of Clinical Epidemiology, vol. 56, no. 10, pp. 968-976, 2003.

[24] P. Royston, "Multiple imputation of missing values: Further update of ice, with an emphasis on interval censoring," Stata Journal, vol. 7, no. 4, pp. 445-464, 2007.

[25] A. D. Shah, J. W. Bartlett, J. Carpenter, O. Nicholas, and H. Hemingway, "Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study," American Journal of Epidemiology, vol. 179, no. 6, pp. 764-774, 2014.

[26] S. G. Liao, Y. Lin, D. D. Kang et al., "Missing value imputation in high-dimensional phenomic data: imputable or not, and how?" BMC Bioinformatics, vol. 15, no. 1, 2014.

[27] S. A. Rahman, Y. Huang, J. Claassen, N. Heintzman, and S. Kleinberg, "Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data," Journal of Biomedical Informatics, vol. 58, pp. 198-207, 2015.

[28] G. Andrew, H. Jennifer, S. Yu-Sung et al., Su Yu-Sung, 2015.

[29] P. Bonissone, J. M. Cadenas, M. C. Garrido, and R. A. Diaz-Valladares, "A fuzzy random forest," International Journal of Approximate Reasoning, vol. 51, no. 7, pp. 729-747, 2010.

[30] H.-H. Hsu, A. C. Yang, and M.-D. Lu, "KNN-DTW based missing value imputation for microarray time series data," Journal of Computers, vol. 6, no. 3, pp. 418-425, 2011.

[31] C. Y. Andy, H. Hui-Huang, and L. Ming-Da, in Microarray Gene Expression Data. In, Kinmen, Taiwan, 2009.

[32] E. Kostadinova, V. Boeva, L. Boneva, and E. Tsiporkova, "An integrative DTW-based imputation method for gene expression time series data," in Proceedings of the 2012 6th IEEE International Conference Intelligent Systems, IS 2012, pp. 258-263, Bulgaria, September 2012.

[33] D. Li, J. Deogun, W. Spaulding, and B. Shuart, "Towards missing data imputation: a study of fuzzy K-means clustering method," in Rough sets and current trends in computing, vol. 3066 of Lecture Notes in Comput. Sci., pp. 573-579, Springer, Berlin, 2004.

[34] J. Tang, G. Zhang, Y. Wang, H. Wang, and F. Liu, "A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation," Transportation Research Part C: Emerging Technologies, vol. 51, pp. 29-40, 2015.

[35] I. B. Aydilek and A. Arslan, "A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm," Information Sciences, vol. 233, pp. 25-35, 2013.

[36] T. Furukawa, S.-I. Ohnishi, and T. Yamanoi, "On a fuzzy c-means algorithm for mixed incomplete data using partial distance and imputation," in Proceedings of the International MultiConference of Engineers and Computer Scientists, IMECS 2014, pp. 319-323, Hong Kong, March 2014.

[37] S. Azim and S. Aggarwal, "Hybrid model for data imputation: using fuzzy c means and multi layer perceptron," in Proceedings of the 4th IEEE International Advance Computing Conference (IACC '14), pp. 1281-1285, Gurgaon, India, February 2014.

[38] J. Tang, Y. Wang, S. Zhang, H. Wang, F. Liu, and S. Yu, "On missing traffic data imputation based on fuzzy c-means method by considering spatial-temporal correlation," Transportation Research Record, vol. 2528, pp. 86-95, 2015.

[39] K. 0. Mikalsen, F. M. Bianchi, C. Soguero-Ruiz, and R. Jenssen, "Time series cluster kernel for learning similarities between multivariate time series with missing data," Pattern Recognition, vol. 76, pp. 569-581, 2018.

[40] L. A. Zadeh, "Fuzzy sets," Information and Computation, vol. 8, pp. 338-353, 1965.

[41] H. J. Sadaei, F. G. Guimaraes, C. Silva, M. H. Lee, and T. Eslami, "Short-term load forecasting method based on fuzzy time series, seasonality and long memory process," International Journal of Approximate Reasoning, vol. 83, pp. 196-217, 2017.

[42] C. P. Pappis and N. I. Karacapilidis, "A comparative assessment of measures of similarity of fuzzy values," Fuzzy Sets and Systems, vol. 56, no. 2, pp. 171-174, 1993.

[43] S. M. Chen, M. S. Yeh, and P. Y. Hsiao, "A comparison of similarity measures of fuzzy values," Fuzzy Sets and Systems, vol. 72, no. 1, pp. 79-89, 1995.

[44] A. Wilbik and J. M. Keller, "A distance metric for a space of linguistic summaries," Fuzzy Sets and Systems, vol. 208, pp. 79-94, 2012.

[45] A. Wilbik and J. M. Keller, "A fuzzy measure similarity between sets of linguistic summaries," IEEE Transactions on Fuzzy Systems, vol. 21, no. 1, pp. 183-189, 2013.

[46] R. J. Almeida, M.-J. Lesot, B. Bouchon-Meunier, U. Kaymak, and G. Moyses, "Linguistic summaries of categorical time series for septic shock patient data," in Proceedings of the 2013 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2013, India, July 2013.

[47] W. Pedrycz and F. Gomide, Fuzzy Systems Engineering: Toward Human-Centric Computing, John Wiley, Hoboken, NJ, USA, 2007.

[48] T.-T. Phan, E. Poisson Caillault, A. Lefebvre, and A. Bigand, "Dynamic time warping-based imputation for univariate time series data," Pattern Recognition Letters, 2017.

[49] J. Garibaldi, C. Chao, and F. Tajul, "FuzzyR: Fuzzy Logic Toolkit for R2017," R package version 2.1, 2017.

[50] D. A. Paul, Missing Data Quantitative Applications in the Social Sciences, vol. 136, Sage Publication, 2001.

[51] M. B. Christopher, Pattern Recognition and Machine Learning (Information Science and Statistics), Secaucus, NJ, USA, Springer-Verlag, 2006.

[52] T. D. Dickey, "Emerging ocean observations for interdisciplinary data assimilation systems," Journal of Marine Systems, vol. 40-41, pp. 5-48, 2003.

[53] M. Schomaker and C. Heumann, "Model selection and model averaging after multiple imputation," Computational Statistics & Data Analysis, vol. 71, pp. 758-770, 2014.

[54] E. J. Keogh and M. J. Pazzani, "An indexing scheme for fast similarity search in large time series databases," in Proceedings of the 11th International Conference on Scientific and Statistical Database Management (SSDBM '99), pp. 56-67, Cleveland, Ohio, USA, July 1999.

[55] S. C. Goslee and D. L. Urban, "The ecodist package for dissimilarity-based analysis of ecological data," Journal of Statistical Software , vol. 22, no. 7, pp. 1-19, 2007.

[56] R. J. Hyndman and Y. Khandakar, "Automatic time series forecasting: the forecast package for R," Journal of Statistical Software, vol. 27, no. 3, pp. 1-22, 2008.

[57] J. Honaker, G. King, and M. Blackwell, "Amelia II: a program for missing data," Journal of Statistical Software , vol. 45, no. 7, pp. 1-47, 2011.

[58] M. David, E. Dimitriadou, K. Hornik, A. Weingessel, and L. Friedrich, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien2015. R package version 1.6-7.

[59] Y.-S. Su, A. Gelman, J. Hill, and M. Yajima, "Multiple imputation with diagnostics (mi) in R: Opening windows into the black box," Journal of Statistical Software, vol. 45, no. 2, pp. 1-31, 2011.

[60] S. van Buuren and K. Groothuis-Oudshoorn, "Mice: multivariate imputation by chained equations in R," Journal of Statistical Software, vol. 45, no. 3, pp. 1-67, 2011.

[61] A. Zeileis and G. Grothendieck, Andrews Felix. Zoo: S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations), vol. 14, 2016.

[62] T. Phan, E. P. Caillault, A. Lefebvre, and A. Bigand, "Which DTW method applied to marine univariate time series imputation," in Proceedings of the OCEANS 2017--Aberdeen, pp. 1-7, Aberdeen, United Kingdom, June 2017.

[63] O. Yazdanbakhsh and S. Dick, "A systematic review of complex fuzzy sets and logic," Fuzzy Sets and Systems, vol. 338, pp. 1-22, 2018.

Thi-Thu-Hong Phan (iD), (1, 2) Andre Bigand (iD), (1) and Emilie Poisson Caillault (1)

(1) Univ. Littoral Cote d'Opale, EA 4491-LISIC, F-62228 Calais, France

(2) Vietnam National University of Agriculture, Department of Computer Science, Hanoi, Vietnam

Correspondence should be addressed to Thi-Thu-Hong Phan; hongptvn@gmail.com and Andre Bigand; bigand@univ-littoral.fr

Received 30 May 2018; Accepted 25 July 2018; Published 9 August 2018

Academic Editor: Shyi-Ming Chen

Caption: Figure 1: Computing scheme of the new similarity measure.

Caption: Figure 2: Membership function of fuzzy similarity values.

Caption: Figure 3: Scheme of the completion process: (1) building queries, (2) comparing sliding windows, (3) selecting the most similar windows, and (4) completing gap.

Caption: Figure 4: Visual comparison of completion data of different imputation approaches with real data on the 1st signal of synthetic series with the gap size of 1000.

Caption: Figure 5: Visual comparison of completion data of different imputation approaches with real data on the 1sf signal of synthetic series with the gap size of 5000.

Caption: Figure 6: Visual comparison of completion data of different imputation approaches with real data on the 2nd signal of MAREL-Carnot dataset with the gap size of 353.
Table 1: Average imputation performance indices of various
imputation algorithms on synthetic dataset (100,000
collected points).

Gap size        Method                  Accuracy indices
                              1-Sim        1-[R.sup.2]       RMSE

1%              FSMUMI        0.136           0.261          0.051
                Amelia        0.275           0.999          0.143
                 FcM          0.231           0.722          0.096
                  MI          0.275           0.999          0.142
                 MICE         0.258           0.944          0.13
              missForest      0.248           0.915          0.122
              na.approx       0.052           0.066          0.019
                DTWUMI        0.257           0.713          0.88

2%              FSMUMI         0.1            0.295          0.046
                Amelia        0.259           0.998          0.147
                 FcM          0.208           0.686          0.104
                  MI          0.259           0.998          0.147
                 MICE         0.244           0.968          0.14
              missForest      0.239           0.968          0.133
              na.approx       0.104           0.278          0.047
                DTWUMI        0.237           0.775          0.867

3%              FSMUMI        0.113           0.341          0.056
                Amelia        0.218           0.911          0.127
                 FcM          0.214           0.601           0.1
                  MI          0.253           0.993          0.141
                 MICE         0.21            0.873          0.118
              missForest      0.188           0.796          0.102
              na.approx       0.148           0.43           0.072
                DTWUMI        0.231           0.799          0.874

4%              FSMUMI        0.06            0.146          0.037
                Amelia        0.208             1            0.14
                 FcM          0.155           0.759          0.095
                  MI          0.208           0.999          0.14
                 MICE         0.209           0.987          0.138
              missForest      0.196           0.968          0.127
              na.approx       0.145           0.721          0.092
                DTWUMI        0.148           0.586          0.918

5%              FSMUMI        0.055           0.132          0.032
                Amelia        0.214           0.997          0.15
                 FcM          0.179           0.715          0.108
                  MI          0.231           0.996          0.167
                 MICE         0.221           0.968          0.152
              missForest      0.212           0.944          0.143
              na.approx       0.16             0.8           0.118
                DTWUMI        0.186           0.885          0.88

7.5%            FSMUMI        0.049           0.071          0.027
                Amelia        0.197           0.998          0.147
                 FcM          0.158           0.809          0.104
                  MI           0.2            0.992          0.15
                 MICE         0.205           0.988          0.15
              missForest      0.188           0.97           0.136
              na.approx       0.192           0.971          0.142
                DTWUMI        0.133           0.653          0.908

10%             FSMUMI        0.061           0.181          0.043
                Amelia        0.202           0.999          0.147
                 FcM          0.164           0.872          0.104
                  MI          0.21            0.997          0.155
                 MICE         0.209           0.996          0.15
              missForest      0.194           0.97           0.135
              na.approx       0.183           0.997          0.129
                DTWUMI        0.155           0.782          0.893

Gap size        Method                  Shape indices
                               FSD           FB          1-FA2

1%              FSMUMI        0.358        3.253         0.364
                Amelia        0.409        2.252         0.773
                 FcM          1.889        2.208         0.996
                  MI          0.421        2.091         0.773
                 MICE         0.406        2.452         0.72
              missForest      0.389        3.976         0.744
              na.approx       0.054         0.29         0.074
                DTWUMI        0.725        0.405         0.69

2%              FSMUMI        0.155        0.395         0.337
                Amelia        0.275        2.005         0.803
                 FcM          1.863        2.289         0.987
                  MI          0.268         2.11         0.81
                 MICE         0.255        7.616         0.759
              missForest      0.279        3.156         0.792
              na.approx       0.224        0.398         0.347
                DTWUMI        0.509        8.449         0.646

3%              FSMUMI        0.219        0.852         0.322
                Amelia        0.133        6.128         0.76
                 FcM          1.832        1.759         0.989
                  MI          0.236        2.295         0.775
                 MICE         0.208        5.118         0.703
              missForest      0.215        1.846         0.627
              na.approx       0.372        2.382         0.577
                DTWUMI        0.332        27.952        0.69

4%              FSMUMI        0.099        0.738         0.299
                Amelia        0.213        2.171         0.807
                 FcM          1.85          2.09         0.986
                  MI          0.196        2.302         0.807
                 MICE         0.22         3.748         0.801
              missForest      0.216         3.94         0.827
              na.approx       0.252        5.251         0.689
                DTWUMI        0.185        12.688        0.719

5%              FSMUMI        0.058        0.098         0.201
                Amelia        0.147        2.238         0.79
                 FcM          1.818        2.194         0.993
                  MI          0.206        3.094         0.808
                 MICE         0.222         2.3          0.79
              missForest      0.315        4.547         0.819
              na.approx       0.352        18.217        0.622
                DTWUMI        0.213        0.723         0.694

7.5%            FSMUMI        0.069        0.505         0.184
                Amelia        0.045        1.305         0.792
                 FcM          1.813        1.866         0.991
                  MI          0.038        1.645         0.797
                 MICE         0.057        10.744        0.799
              missForest      0.284        4.396         0.812
              na.approx       0.669        2.163         0.712
                DTWUMI        0.064        1.113         0.571

10%             FSMUMI        0.114        0.511         0.26
                Amelia        0.034        4.062         0.788
                 FcM          1.837        2.201         0.992
                  MI          0.12         2.954         0.785
                 MICE         0.055        3.994         0.779
              missForest      0.308        3.024         0.811
              na.approx       0.372        1.455         0.719
                DTWUMI        0.026        1.182         0.626

Table 2: Average imputation performance indices of various
imputation algorithms on simulated dataset (32,000 collected
points).

Gap size         Method                   Accuracy indices
                                1-Sim        1-[R.sup.2]       RMSE

1%               FSMUMI         0.083                0.515     1.033
                 Amelia         0.157                    1     2.206
                  FcM           0.118                0.998     1.483
                   MI           0.16                 0.999     2.241
                  MICE          0.159                0.998     2.201
               missForest       0.127                0.998     1.608
               na.approx        0.146                0.992     1.901
                 DTWUMI         0.09                 0.552     1.156

2%               FSMUMI         0.068                0.487     1.166
                 Amelia         0.12                 0.998     2.312
                  FcM           0.093                0.999     1.672
                   MI           0.12                     1     2.307
                  MICE          0.119                0.999     2.282
               missForest       0.096                    1     1.769
               na.approx        0.118                    1     2.261
                 DTWUMI         0.074                0.523     1.545

3%               FSMUMI         0.068                0.453     1.053
                 Amelia         0.13                 0.999     2.212
                  FcM           0.098                0.999     1.526
                   MI           0.13                 0.999     2.197
                  MICE          0.129                    1     2.19
               missForest       0.102                0.999     1.626
               na.approx        0.116                0.997     1.938
                 DTWUMI         0.073                0.526     1.189

4%               FSMUMI         0.064                0.412     1.067
                 Amelia         0.122                    1     2.305
                  FcM           0.096                    1     1.607
                   MI           0.125                    1     2.261
                  MICE          0.124                0.999     2.233
               missForest       0.101                    1     1.726
               na.approx        0.109                    1     1.99
                 DTWUMI         0.066                0.465     1.172

5%               FSMUMI         0.063                0.404     1.062
                 Amelia         0.122                    1     2.273
                  FcM           0.092                    1     1.619
                   MI           0.123                    1     2.287
                  MICE          0.121                    1     2.267
               missForest       0.097                0.999     1.731
               na.approx        0.114                    1     1.988
                 DTWUMI         0.063                0.454     1.166

7.5%             FSMUMI         0.06                 0.408     1.063
                 Amelia         0.117                    1     2.232
                  FcM           0.09                     1     1.605
                   MI           0.119                0.999     2.259
                  MICE          0.118                    1     2.238
               missForest       0.094                0.999     1.695
               na.approx        0.108                    1     1.958
                 DTWUMI         0.065                0.477     1.19

10%              FSMUMI         0.061               0.4226     1.086
                 Amelia         0.117                    1     2.269
                  FcM           0.089                    1     1.607
                   MI           0.118               0.9996     2.233
                  MICE          0.118               0.9998     2.254
               missForest       0.094               0.9999     1.702
               na.approx        0.11                     1     1.958
                 DTWUMI         0.067               0.5371     1.293

Gap size                 Shape indices
                FSD           FB         1-FA2

1%             0.159         2.51        0.574
               0.232        3.619        0.794
               1.98         2.015        0.998
                0.2         0.915        0.799
               0.214        1.449        0.801
               0.836        12.034       0.861
               0.393        18.997       0.777
               0.007        6.022        0.562

2%             0.194        1.971        0.611
               0.107        2.191        0.794
               1.985         1.96        0.998
               0.123        3.949        0.789
               0.114        8.881        0.789
               0.941        2.777        0.858
               0.721        2.059        0.786
               0.008        3.686        0.583

3%             0.076        10.649       0.582
               0.062        3.779        0.794
               1.984         2.22        0.997
               0.078        9.374        0.795
               0.067        1.938        0.792
               0.855        2.407        0.851
               0.518        1.974        0.818
               0.01         8.725        0.567

4%             0.061        1.374        0.568
               0.032        2.446        0.764
               1.982        2.325        0.997
               0.043        2.391        0.792
               0.045        42.495       0.791
               0.876        2.901        0.854
               0.475         1.94        0.811
               0.004        2.079        0.547

5%             0.062        4.508        0.577
               0.028        4.109        0.798
               1.984        2.192        0.998
               0.024        5.582        0.797
               0.044        2.326        0.792
               0.923        2.473        0.859
               0.567        2.247        0.809
               0.003        1.594        0.545

7.5%           0.049        4.843        0.566
               0.034        3.306        0.792
               1.981        3.562        0.998
               0.025        1.946        0.793
               0.032        9.359        0.794
               0.907        1.259        0.858
               0.461        3.089        0.816
               0.004        3.851        0.566

10%            0.051        5.558        0.572
               0.021        3.074        0.793
               1.981        2.683        0.997
               0.02          2.05        0.793
               0.018        3.424        0.793
               0.909         1.87        0.857
               0.541        2.006        0.798
               0.012        3.093        0.577

Table 3: Average imputation performance indices of various
imputation algorithms on MAREL-Carnot dataset (35,334
collected points).

Gap size        Method                  Accuracy indices
                              1-Sim       1-[R.sup.2]       RMSE

1%              FSMUMI        0.051          0.156          1.532
                Amelia        0.187          0.544          5.132
                  FcM         0.156          0.342          4.037
                  MI          0.192          0.561          5.282
                 MICE         0.166          0.608          5.596
              missForest      0.165          0.472          4.422
               na.approx      0.061          0.171          1.748
                DTWUMI        0.084          0.181          2.466

2%              FSMUMI        0.045          0.037          1.446
                Amelia        0.146          0.369          4.743
                  FcM         0.116           0.06          3.418
                  MI          0.146          0.364          4.72
                 MICE         0.129          0.369          4.711
              missForest      0.116          0.155          3.575
               na.approx       0.06           0.07          2.012
                DTWUMI        0.042          0.018          1.095

3%              FSMUMI        0.053           0.11          1.294
                Amelia        0.176          0.503          4.694
                  FcM         0.139          0.251          3.35
                  MI           0.17          0.531          4.474
                 MICE         0.157          0.552          4.905
              missForest      0.139          0.345          3.556
               na.approx      0.068          0.224          1.79
                DTWUMI        0.096          0.216          2.587

4%              FSMUMI        0.059          0.058          1.466
                Amelia        0.171           0.44          4.389
                  FcM         0.126          0.152          2.779
                  MI          0.166           0.41          4.234
                 MICE          0.15          0.379          4.15
              missForest      0.129          0.234          3.134
               na.approx      0.077           0.13          2.006
                DTWUMI         0.07          0.105          1.77

5%              FSMUMI        0.051           0.22          2.025
                Amelia        0.151          0.551          4.924
                  FcM         0.113          0.337          3.606
                  MI          0.143          0.567          4.612
                 MICE         0.131          0.523          4.75
              missForest      0.104          0.371          3.443
               na.approx      0.065          0.213          2.071
                DTWUMI        0.067          0.275          2.363

7.5%            FSMUMI        0.043          0.056          1.52
                Amelia         0.14           0.42          4.546
                  FcM         0.104          0.123          3.12
                  MI          0.142          0.427          4.624
                 MICE         0.126           0.38          4.375
              missForest      0.112          0.202          3.587
               na.approx      0.073          0.081          2.043
                DTWUMI         0.06          0.102          1.999

10%             FSMUMI        0.053          0.098          1.642
                Amelia         0.14           0.3           4.294
                  FcM          0.1           0.098          3.68
                  MI           0.14          0.112          4.294
                 MICE          0.12           0.42          4.066
              missForest      0.097          0.461          3.049
               na.approx      0.071          0.529          1.873
                DTWUMI        0.081          0.381          3.293

Gap size        Method                   Shape indices
                               FSD            FB           1-FA2

1%              FSMUMI        0.044          0.081         0.191
                Amelia        0.378          0.354         0.482
                  FcM          0.4           0.347         0.338
                  MI          0.396          0.365         0.497
                 MICE         0.423          0.35          0.436
              missForest      0.385          0.355         0.381
               na.approx      0.067          0.06          0.161
                DTWUMI        0.214          0.149         0.198

2%              FSMUMI        0.053          0.083         0.182
                Amelia        0.211          0.222         0.429
                  FcM         0.415          0.237         0.231
                  MI          0.218          0.228         0.435
                 MICE         0.197          0.21          0.413
              missForest       0.33          0.193         0.258
               na.approx      0.045          0.094         0.214
                DTWUMI        0.029          0.066         0.154

3%              FSMUMI        0.134          0.08          0.166
                Amelia        0.426          0.224         0.478
                  FcM         0.441          0.237         0.314
                  MI          0.354          0.221         0.476
                 MICE          0.34          0.184         0.429
              missForest      0.422          0.184         0.346
               na.approx      0.062          0.056         0.169
                DTWUMI        0.329          0.136         0.223

4%              FSMUMI        0.094          0.101         0.183
                Amelia        0.287           0.2          0.456
                  FcM         0.285          0.203         0.727
                  MI          0.277          0.204         0.444
                 MICE         0.268          0.19          0.411
              missForest       0.23          0.187         0.303
               na.approx      0.068          0.135         0.268
                DTWUMI         0.15          0.12          0.138

5%              FSMUMI        0.227          0.152         0.167
                Amelia        0.303          0.189         0.461
                  FcM         0.301          0.199         0.254
                  MI          0.249          0.123         0.448
                 MICE         0.274          0.188         0.419
              missForest      0.229          0.147         0.274
               na.approx      0.175          0.038         0.233
                DTWUMI         0.22          0.157         0.242

7.5%            FSMUMI        0.075          0.039         0.189
                Amelia        0.191          0.197         0.437
                  FcM         0.328          0.198          0.23
                  MI          0.222          0.222         0.443
                 MICE         0.206          0.208         0.437
              missForest      0.329          0.228         0.288
               na.approx      0.092          0.107         0.243
                DTWUMI        0.071          0.074         0.215

10%             FSMUMI        0.083          0.055         0.191
                Amelia         0.24          0.142         0.442
                  FcM         0.136          0.101         0.303
                  MI           0.24          0.142         0.442
                 MICE         0.152          0.077         0.383
              missForest      0.104          0.117         0.255
               na.approx      0.098          0.094         0.253
                DTWUMI        0.119          0.124         0.224

Table 4: Computational time of different methods on the
synthetic series in second (s).

Method           1%          2%          3%          4%          5%

FSMUMI          353.9       427.5       701.9      1037.8      1423.6
Amelia           3.2         3.4         5.2         3.2         3.2
FcM             40.9        39.8        40.0        41.1        41.2
MI              844.1       714.0       739.1       723.3       724.5
MICE           7021.1      9187.7      21909.6     13041.9     14833.9
missForest     26833.8     24143.8     22969.9     32056.6     36485.8
na.approx       0.11        0.089       0.167       0.09        0.088
DTWUMI         5002.67     15714.8    37645.82    64669.71    86435.38

Method          7.5%         10%

FSMUMI         2525.5      3556.8
Amelia           3.2         3.2
FcM             46.7        45.6
MI              719.7       726.5
MICE           19417.7     23812.6
missForest     42424.1     28521.1
na.approx       0.088       0.094
DTWUMI        180887.78    273879
COPYRIGHT 2018 Hindawi Limited
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2018 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Research Article
Author:Phan, Thi-Thu-Hong; Bigand, Andre; Caillault, Emilie Poisson
Publication:Applied Computational Intelligence and Soft Computing
Article Type:Report
Geographic Code:1USA
Date:Jan 1, 2018
Words:11554
Previous Article:Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis.
Next Article:Development of Decision Support Model for Selecting a Maintenance Plan Using a Fuzzy MCDM Approach: A Theoretical Framework.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |