# Statistical Characterization, Pattern Identification, and Analysis of Big Data.

1. INTRODUCTION

Data-based modeling, prediction, forecasting, and associated decision-making is becoming an indispensable approach in many engineering applications, business operations, prevention and mitigation of natural disasters, and analysis of social events [1]. In the Big Data era, how to extract useful information from the sea of data and how to find the general trends and the basic structures of the data are the key to comprehend the world, to win business, and to survive. Statistics and probability are the foundation of Big Data analysis, and the occurrence of an event can often be appropriately described with a probabilistic distribution function.

A probabilistic distribution function roughly consists of two parts: the middle and the tails. Based on the contribution of the tails to the overall damage or impact, probabilistic events can be divided into two categories: (1) spectrum events, and (2) extreme/rare events [2]. The damage caused by the constituents in the spectrum events is comparable, oftentimes, dominated by the mean behaviors. The traditional continuous two-parameter normal, lognormal, and Weibull distribution functions are often used to describe the probabilistic distribution of the spectrum events such as fatigue life. By contrast, the damage caused by the extreme/rare events, such as 100-year flood or mega-earthquake or nuclear accident etc. is controlled by the tails. Probabilistic distribution functions such as the Pareto power law and exponential functions are more appropriate than the traditional functions in interpreting these extreme/rare events. It should be emphasized here that many phenomena in both the natural and social sciences have power law statistics (Pareto distribution). These phenomena include city sizes, incomes, word frequencies, earthquake magnitudes, and many other natural and manmade engineering events [3]. A power law distribution implies that small occurrences are extremely common, whereas large instances are extremely rare. In extreme/rare events, the very large and rare events have more significant impact and consequence than the small and common events. It should be noted that the extreme/rare events are very complex. It has been found that despite the fact that a power law models the tails of the empirical distribution well, the largest events are significantly outlying, meaning much larger or smaller than what would be expected under the power law. Such events are interpreted as "Dragon Kings" as they indicate a departure from the generic process underlying the power law [4, 5]. The extreme/rare events are difficult to understand in their nature and are traditionally treated as "Black Swan" phenomena [6] meaning they rarely happen, with significant consequences when they happen, but are unpredictable. However, since these phenomena bear significant consequences, deep understandings of these "Dragon Kings" behaviors and substantial development of the related predictive tools would be highly demanded [4, 5]. These two categories of the probabilistic events are different in nature. How to characterize and categorize these probabilistic events and subsequently analyze the data are critical in engineering, natural, and social areas in which the analysis of Big Data is concerned.

In this paper, the general probabilistic description of events is given first, with emphasis on the Weibull and Pareto power law distributions. Level of measurement, which is fundamental in quantitatively assessment of probabilistic events is also reviewed and discussed. Events categorization based on commonality/rarity and impact/effect are described subsequently and an event quadrant is proposed for ease of events categorization. To facilitate the damage analysis and distinguish between extreme/rare events and the spectral events, the concept of a damage density function is proposed and some analysis results are discussed. Finally, statistical data from a wide variety of engineering, natural, and social events are analyzed to demonstrate the concepts and ideas developed in this paper.

2. CHARACTERIZATION OF PROBABILISTIC EVENTS

2.1. Description of a probabilistic event

The probabilistic behavior of an event can be simply and effectively described using the probability (P - x) diagram shown in Fig. 1. The horizontal axis in Fig. 1 represents the variable of interest while P along the vertical axis represents the cumulated distribution function (CDF), which relates the probabilistic density function (PDF) p(x) through [mathematical expression not reproducible]. The lower limit a and the upper limit b are finite values for most of the real problems. The infinites a = -[infinity] and b = [infinity] are often a mathematical idealization.

For a sequence of n independent and random data points, there will be a corresponding discrete sequence of values of x, say [x.sub.1], [x.sub.2] ,... [x.sub.n] (ranked in an ascending order). No matter what the p(x) looks like, P(x) can always be estimated based on the ranked data. The accuracy of the estimation depends on the sample size and the estimation model. For a given reasonably accurate model, the larger the sample size, the more accurate the estimated P(x). The simplest estimation of P(x) is i/n [7], which obviously has some issues. For example, when i =n, the formula results in P([x.sub.n]) = 1, which contradicts the fact that any randomly generated data can be close to but never reach P = 1. i is the rank order number, and again n is the total number of data points. Similarly, several other simple formulae such as (i - 1)/n, (i - 1/2)/n, and i/(n + 1) have similar issues [7]. Based on the cumulative binomial distribution and median rank (MR) formula, the so called Benard's approximation, Eq.(1), has been developed and it is now widely used in industry to estimate P(x)[7].

P([x.sub.i] [approximately equal to] MR = I - 0.3/n + 0.4

(1)

It should be noted that the randomly generated data as projected onto the vertical axis P as shown in Fig. 1 is uniformly distributed because of the equal opportunity of the occurrence of each event in terms of the accumulated function. Therefore, the binomial distribution function can be assumed regardless of the information about what the exact forms of p(x) and P(x) are. It should be emphasized here that the generation process of uniformly distributed random number is also a key step in Monte Carlo simulation for generating random number for any specific distributions, such as Weibull or Lognormal.

The working mechanism of creating a probability plot using the median rank approach is schematically illustrated in the P - x diagram shown in Fig. 1. The distribution of p(x) is determined by the location of the data points on x axis. P(x) can be obtained by integrating p(x), theoretically, or by using the approximation formula, Eq. (1). The data horizontally projected onto the vertical axis from the P(x) curve is eventually uniformly distributed. In other words, corresponding to the uniform distributed data points on the vertical axis P , the intervals between the adjacent points on x axis are unequally spaced with the dense segments at the peak of the p(x) curve. The segments become coarser as the data points go further away from the peak.

2.2. Level of Measurement

The exact forms of P(x) and p(x) also depend upon the level (unit) and the scale of the x axis. Level of measurement or scale of measure is a classification that describes the nature of information within the numbers assigned to variables. Some physical quantities, such as cycles to failure, energy, city population etc., can be directly used. The use of unit in log or exponential form and other objective and subjective measurements are also very common. In addition to the precise probabilistic descriptions as illustrated in Section 2.1, oftentimes, classification with 3-, 4-, 5-, 10-, 12-, 100-level measures are also frequently used for ease of communication and simplicity. For example, Beaufort wind force scale (0-12) is an empirical measure that relates wind speed to observed conditions at sea or on land. Richter magnitude scale for earthquake assigns a magnitude number to quantify the energy or moment released by an earthquake. Richter scale is a base-10 logarithmic scale, which defines magnitude as the logarithm of the ratio of the amplitude of the seismic waves to an arbitrary, minor amplitude. Earthquakes are also classified as moderate if their magnitude is in the range of 5-5.9, strong if the magnitude is in the range of 6-6.9, major if the magnitude is in the range of 7-7.9, and great if the magnitude is 8 or larger [8]. The decibel (dB) is also a logarithmic unit used to repress the ratio of two values of a physical quantity, often power or intensity. The Numeric Rating Scale (NRS-11) is an 11-point scale for patient self-reporting of pain. The FICO credit scores are designed to measure the risk of default by taking into account various factors in a person's financial history. The generic or classic FICO score is between 300 and 850.

There are some other measures without numbering system but with colors and graph symbols (ideogram) to represent severity level. For example, Consumer Reports [9] uses five modified Harvey Balls to assess reliability performance of products: Open-Red-Circle: [??], Half-Red-Circle [??], Open-Black-Circle [??] Half-Black-Circle [??], and Filed-Black-Circle [??], in order from better to worse. J.D. Power [10] uses Power Circle ratings of five, four, three, and two: (1) Among-The-Best [??], Better-Than-Most [??] Above-Average [??], and the Rest [??]. J.D. Power does not publish a rating lower than two Power Circles. Similar examples also include: the US Homeland Security Advisory System is a color-code terrorism threat advisory scale: Green (Low), Blue (Guarded), Yellow (Elevated), Brown (High), and Red (Severe). Clearly, different measures of the variables result in different statistical characterization methods and associated probabilistic distributions. Therefore, the empirical interpretation of the probabilistic behavior must be consistently based on a certain level of measurement.

2.3. Two Types of Probabilistic Distribution Functions

The commonly used Weibull distribution for spectrum events and the Pareto distribution for extreme/rare events are described below for events that can be described using a continuous probabilistic distribution function.

2.3.1. Weibull Distribution Functions

In many engineering applications, Weibull distribution functions are often used to describe a spectrum event, with which all of the spectra under normal conditions are evaluated so that all major contributions can be properly considered [1]. These spectra include the loads experienced by the components in nuclear power plants, ground vehicles, and landing gears of aircrafts. The three-parameter Weibull distribution, Eq. (2), is such a probabilistic distribution function to describe the spectrum events.

p(x) = ([beta]/[eta])[(x - a/[eta]).sup.[beta]-1] exp[-[(x - a/[eta]).sup.[beta]]]

(2)1

P(x) = 1 - exp{-[[(x - a)/[eta]].sup.[beta]]};0 < a [less than or equal to] x < [infinity],[eta], [beta] > 0

(2)2

When a = 0 in Eq. (2), the three-parameter Weibull distribution can be further reduced to the conventional two-parameter Weibull distribution. Both three-parameter and two-parameter Weibull functions have been implemented into commercial software, such as Minitab [11], and therefore, the detailed analysis will not be provided here.

2.3.2. Pareto (Power Law) Distribution Function

When dealing with crises and extremes, power law tails are "normal" [3]. The unique property of power law is that they are scale-invariant/self-similar/fractal. This property implies that all events, both large and small, are generated by the same mechanism. A continuous variable with a power law distribution has a probability p(x)dx of taking a value in the interval from x to x + dx, where p(x) = [Hx.sup.-a] with [alpha] > 0 [3]. The constant in the power law is given by the normalization

requirement that [mathematical expression not reproducible]. Then, [mathematical expression not reproducible] must be held. [x.sub.min] is the lower bound of the distribution, above which the power law is obeyed. Finally, the Pareto PDF and CDF can be expressed in Eq. (3).

p(x) = [alpha]-1/[x.sub.min][(x/[x.sub.min]).sup.-[alpha]]

(3)1

p(x) = 1 - [([x.sub.min]/x).sup.[alpha]-1]

(3)2

The fit parameter [alpha] of the Pareto distribution can be derived by applying the least squares method or the maximum likelihood method. However, results show that the maximum likelihood method provides a better data correlation than the least squares method [3]. The natural log likelihood function is shown in Eq. (4).

[mathematical expression not reproducible]

(4)

The maximum likelihood is found by differentiating L([alpha]) in Eq. (4) with respect to parameter [alpha], setting the result equal to zero. Upon rearrangement, this yields the estimator equation

[??] = 1+n[[[n.summation over (i=1)]ln([x.sub.i]/[x.sub.min])].sup.-1]

(5)

where [x.sub.i] (i = 1,...,n) are the n data points for [x.sub.i] [greater than or equal to] [x.sub.min]. In practical situations [x.sub.min] usually corresponds not to the smallest value of x measured but to the smallest for which the power law behavior holds [3].

Identifying power law behavior is important but the process can be tricky [3]. The standard strategy makes use of a histogram of a quantity with a power law distribution appears as straight line when plotted on logarithmic scales. However, bin size selection is always an issue for this treatment and noise can be generated that hinders the data interpretation.

Another, and in many ways a superior, method of plotting the data is to calculate a CDF [3]. Instead of plotting a histogram of the data, a plot of the CDF, called complementary CDF (C-CDF or CP) here, has a value greater than or equal to x : [mathematical expression not reproducible]. If the distribution follows a power law, then

CP(x) = H/[alpha] - 1 [x.sup.-([alpha]-1)]

(6)

thus, the C-CDF also follows a power law, but with an exponent of [alpha] - 1, indicating a straight line on logarithmic scales, but with a shallower slope if the power law distribution is held [3]. This can be used as a check if a distribution follows the power law distribution. But notice that there is no need to bin the data at all to calculate CP(x). By its definition, CP(x) is well-defined for every value of x and so can be plotted without binning.

In the natural, social, and engineering world, some events may be very common, such as low amplitude fatigue vibration as experienced by a vehicle and the low amplitude earthquakes, but their impact might not be remarkable, whereas some events might be rare, but their consequence might be significant, such as a mega-earthquake. Based on the combination of the commonality/rarity and the impact/effect an event can be roughly categorized into one of the four events as shown in the events quadrant shown in Fig. 2.

The first plot in Fig. 2(a) schematically shows a monotone decreasing probabilistic function, which could be a Pareto type. Therefore, common events occur on the left side and the rare events occur on the right side. It is assumed that the curve of damage per event (D/E) shown in the second plot of Fig. 2(a) is a monotone increasing function of the magnitude, which is often the case in fatigue and earthquake. Clearly, an event that happens on the right side of the D/E curve shows higher impact than that on the left side. Fig. 2 (b) shows four typical combinations of the events in terms of total possible damage (D): Quadrant-I: Rare-High Impact, Quadrant-II: Rare-Low Impact, Quadrant-III: Common-Low Impact, and Quadrant-IV: Common-High Impact.

2.5. Event Characterization with Damage Density Function

How to quantitatively discriminate between spectrum events and extreme/rare events is an interesting and important topic. Here in this paper, a damage density function is developed as a parameter to characterize the two events. The damage density function is proposed and expressed in Eq. (7)

g(s) = p(s)D(s)

(7)

g(S) in Eq.(7) consists of two parts: (1) the occurrence probability p(S) at level S, earthquake amplitude for example, and (2) the damage D(S) made at the level S. Clearly, g(S) shown in Eq.(7) measures the damage probability per event at a level S, so called the damage density function. The total damage E[D] is an important parameter and valid as long as all of the damaging events are properly considered. However, E[D] alone cannot determine which event is dominating and which event is negligible in terms of damage. In these cases, the damage density function g(S), Eq. (7), provides some unique advantages. Several examples are provided below to demonstrate the properties and the effectiveness of g(S).

Without losing generality, the damage per event (D/E) can be written as Eq. (8).

D(S) = [(S/C).sup.m]

(8)

Eq. (8) can be derived from a power law relationship between load amplitude and characteristic life or damage. A specific example is the fatigue S-N curve [2], Eq. (9).

S = [CN.sup.-1/m]

(9)

Eq. (8) can be obtained from Eq. (9) with a linear damage assumption D = 1/N. Three different p(S) are considered to demonstrate the power of the density function density g(S) : (1) the Pareto power law, (2) the exponential distribution, and (3) the two-parameter Weibull distribution.

1. Pareto Power-Law

For the Pareto power law distribution function, Eq. [(3).sup.1], the damage power density function shown in Eq. (7) as a function of stress amplitude or range can be derived with the help of Eq. (8), and the formula is expressed in Eq. (10).

[mathematical expression not reproducible]

(10)

Clearly, there are simply three scenarios for the Pareto power law probability distribution of the occurrence probability p(S): (1) increasing damage density S when m > [alpha], (2) decreasing damage density with S when m < [alpha], and (3) a constant damage density [mathematical expression not reproducible] with S when m = [alpha]. The implications of these observations are: for m > [alpha], the right tail dominates the contributions to the damage. Similarly, for m < [alpha], the left tail plays a dominant role in damage. These two cases generally represent the extreme/rare events. For m = [alpha], the damage contributions from all constituents are comparable, so that it belongs to the category of the spectrum events. This example clearly demonstrates the capability of the damage density function g(S) in distinguishing between the extreme/rare events and the spectrum events.

2. Exponential Distribution Function

For the exponential distribution function, Eq. (11), the damage power density function can be expressed in Eq. (12).

p(S) = [theta] exp (-[theta]S)

(11)

g(S) = [theta]/[C.sup.m] [S.sup.m] exp(-[theta]S)

(12)

Taking derivative of g(S) with respect to S, i.e. [partial derivative]g(S)/[partial derivative]S = 0, leads to S = m/[theta]. Taking derivative of g(S) with respect to S for one more time

results in [[partial derivative]g.sup.2](S)/[[partial derivative]S.sup.2] = - m [[theta]S.sup.m-2]/[C.sup.m] exp(- [theta]S) < 0 at S = m /[theta]. Therefore, there is a maximum damage at S = m/[theta], and the damage is tailing off on both left and right sides of the distribution. In this case, the damage is dominated by the middle part of the probabilistic distribution function while the damage contributions from the tails are smaller, therefore, it belongs to the category of the spectrum events.

3. Two-Parameter Weibull Distribution Function

For the two-parameter Weibull distribution, Eq. [(2).sub.1] with a = 0, the damage power density function can be expressed in Eq. (13).

g(S) = ([beta]/[eta])[(S/[eta]).sup.[beta]-1] exp[-[(S/[eta]).sup.[beta]][(S/C).sup.m]

(13)

Taking derivative of g(S) with respect to S, i.e. [partial derivative]g(S)/[partial derivative]S = 0 leads to s = [eta][(m + [beta] - 1/[beta]).sup.1/[beta]]. Taking derivative of g(S) twice with respect to S results in [mathematical expression not reproducible] at S = [eta][(m + [beta] - 1/[beta]).sup.1/[beta]] for m > 1 - [beta], which is true for most applications.

When [beta] = 1 and a = 0, Eq.(2) is reduced to Eq.(11), which is the exponential distribution. When [beta] = 2 and a = 0, Eq. (2) is the Rayleigh distribution. The analysis results of the exponential and the two-parameter Weibull distributions show that the dominating damage events are located in the middle of the distributions, so that the load profiles at the left and the right tails are not dominating and can be ignored. By contrast, the Pareto power law often leads to extreme/rare events in which the tails make the dominant contributions.

3. CASE STUDIES AND THE RESULTS

The concepts developed in the previous sections will be demonstrated in this section with the following examples from a wide variety of engineering, natural, and social events:

1. Vehicle reliability survey results, which further consist of Consumer Reports' reliability survey, J.D. Power's dependability survey results, and Consumer Reports' vehicle road test score results. The purpose of these examples is to show that in many applications, the sophisticated probabilistic description may not be conveniently used, instead, coarser levels of measurement, such as the modified Harvey Balls, Power Circle and other ideograms for visual communication of qualitative information are often used. Discrete score is a more quantitative approach to describe a probabilistic event.

2. Warranty analysis, prediction, and forecasting. In this example, the three-parameter and two-parameter Weibull distribution functions are used to describe the warranty of a product, which can be considered as a spectrum event.

3. Salary rate of faculty and staff of the University of Michigan at its three campuses.

4. The population distribution of cities in 3 countries: China, the USA, and the UK.

5. The distribution of earthquakes Worldwide and in the conterminous USA.

The last 3 examples demonstrate that the distributions of salary, city, and earthquake can be treated as extreme/rare events, which can be appropriately described using the Pareto power law.

3.1. Case-I: Vehicle Reliability and Dependability

3.1.1. Consumer Reports' Reliability Survey

Consumer Reports (CR) survey [9] on vehicle reliability is the largest of its kind. CR subscribers report (magazine and website) on any serious problems they had with their vehicles during the past 12 months in any of the 17 trouble spots. There are 740, 000 responses to 2015 Annual Auto Survey (Web subscribers) and 1.1 million responses to 2014 survey (web + magazine subscribers). Table 1 shows a CR reliability history of a model of Ford-150 for five model years. Eight trouble spots among a total of 17 spots are listed in Table 1. The modified Harvey Balls are used in the reliability assessment. Table 1 shows that the reliability of Engine Major and Exhaust are stable and high for all of the 5 model years; the Drive System has been continuously improving for the 5 model years, from the low to high; the Audio System has been going a wavy path of up and down. More information about the applications of the reliability assessment methods on other vehicles and programs can be found in [9].

3.1.2. J.D. Power' Dependability Survey

The 2016 U.S. Vehicle Dependability Study [10] is based on response from 33, 560 original owners of 2013 model-year vehicles after three years of ownership. The study covers 177 specific problem symptoms grouped into 8 major vehicle categories: Exterior, Engine/Transmission, Audio/Communication/Entertainment/Navigation (ACEN), Interior, The Driving Experience, Features/Controls/Displays (FCD), Heating/Ventilation/Air Conditioning (HVAC), and Seats. Dependability is a measure of a system's availability, reliability, and its maintainability, and maintenance support performance, and, in some cases, other characteristics such as durability, safety and security.

Table 2 provides the J.D. Power 2016 U.S. Vehicle Dependability [Study.sup.SM] (VDS) on Powertrain Dependability by Make. The Power Circles are used to assess the dependability of the Make. It shows that Lexus is the only one in the Among-The-Best category. The J.D. Power website provides more information about the survey results [10].

3.1.2. Consumer Reports' Vehicle road test score

The road test scores of more than 270 vehicles are collected and reported in the Consumer Reports' comprehensive test program [9]. The scores are presented on a 100-point scale. The tests were based on the results from more than 50 individual tests and evaluations, including performance, comfort and convenience, fuel economy, and more [9]. In performance assessment, the vehicles are divided by category and ranked according to their overall test scores. The test vehicles in terms of categories are: electric cars/plug-in hybrids, subcompact cars, small 2-door cars, compact cars, midsized cars, large cars, luxury compact cars, luxury convertibles, luxury midsized cars, ultra-luxury cars, sports cars, wagons (all-wheel drive), small SUVs, midsized SUVs, large SUV, luxury compact SUVs, pickup trucks etc. [9]. Some models are included in multiple categories, as appropriate. Fig.3 shows the histogram of the road test scores of all of the vehicles. A histogram divides sample values into many intervals called bins. Bars represent the number of observations falling with each bin (its frequency).

From Fig.3 it is seen that more vehicles are located in the score range [60, 85], resembling the normal distribution but the distribution is bounded on both right and left sides. It should be noted that each subcategory may show different patterns. However, this paper does not attempt to investigate the subcategories because of very limited vehicle models and numbers in each subcategory. Overall, Fig.3 provides the general features and patterns of the distribution of the vehicles as well as provides the bounds of the scores in the tested vehicles. The measurement system is bounded by 100 on the right side, while the rest data are tailing off to the left. The best vehicle according to the scores is Tesla Model S P85D (100) in both the categories of electric cars/plug-in hybrids as well as ultra-luxury cars, followed by BMW 750i xDrive (99), BMW M235i (98). The vehicle with the lowest score is Jeep Wrangler Limited (20), which is followed by Mitsubishi Mirage ES (29). Even though 20 is the lowest value in the ranking, it does not mean it is the lower bound in the distribution. First, the test score could be worse. Second, for a vehicle, it must satisfy some basic functions and therefore, it could not get a 0 in the score. This primarily depends on how the score system is predetermined. Furthermore, no quantitative probabilistic fit is attempted for this set of data because the scores contain subjective measures, such as comfortableness, which is tester dependent.

3.2. Warranty Forecasting of a LCD Product

Warranty data of a LCD product collected from January 2009 as published in the literature [12] are reanalyzed here to demonstrate the importance of probabilistic distribution function in warranty forecasting and warrant policy making. The data on sales for the month are tabulated in Table 3, which provides the detailed information about how many products were sold on this month and how many products sold on a certain month fail along time. For example, 0, 9, and 33 products sold on January-09 have failed respectively on months of January-09, February-09, and March-09. The 33 units of January-09 sales failed in March-09 constitute 0.32% of the January-09 sales. These warranty data were traced only within limited months. For example, the January-09 failure data were traced until March 10, and beyond 15-month period, the failure was right censored. The warranty data were analyzed with two-parameter Weibull distribution function [12]. In this paper, the three-parameter Weibull distribution, Eq. (2), is used to reanalyze the data with the emphasis on the importance of the probabilistic model to warranty prediction and forecasting.

The values of the fit parameters of the two-parameter (taken from [12]) and the three-parameter Weibull functions are listed in Table 4. The comparison of fit results is shown in Fig. 4 for the sales in the month. Clearly, as compared to the two-parameter Weibull distribution the three-parameter Weibull distribution provides overall better data correlation and goodness-of-fit for the January sales shown in Fig. 4 (a) and (b), judged by visual examination. It should be noted that a goodness-of-fit measure, such as the Anderson-Darling (AD) statistic, would be helpful to compare the two- and three-parameter Weibull distribution functions. However, this is not done here in this paper because the AD statistical analysis of a single data set could result in misleading results for data with small and varying sample sizes. This point can be easily understood with the fact that, as compared to the linear equation, a polynomial equation results in smaller AD statistic (theoretically, the smaller the AD statistic, the better the curve fit) for a data set with a small sample size even though the underlying relationship could be linear.

The differences made in the two probabilistic distribution models can be demonstrated in the following simple examples. For the product sold in January for 5 months, the accumulated failure percentage by the two- parameter Weibull distribution is 0.82%, whereas the three-parameter Weibull calculated value is 1.21%, which is closer to the observed warranty percentage of 1.29%. The relative difference between the two prediction methods is about 30%. The most important usage of the warranty data is the long-term forecasting capability. For 50 months, the two-parameter results in 52.24% failure, whereas the three-parameter results in only 22.10% failure. This difference is significant in terms of the long-term prediction and forecasting. The results and their comparison of these two examples are listed in Table 5. It should be noted that since the warranty data after sales of 50 months for the product sold in January were not traced, therefore, which distribution function is more accurate and reliable is essentially unknown. However, there is no doubt that the selection of the probabilistic reliability model is critical in both short-term and long-term warranty prediction and forecasting.

Based on the warranty data and the reliability models, the accumulated failure that could occur in the current warranty period months can be forecasted. These results of failures also would allow the manufactures to make a decision on extending warranty length as needed or to make a change in the strategic plans for marketing.

3.3. The Salary Rate of Faculty and Staff of the University of Michigan (UM)

The Salary Record report for individuals on each of the three campuses (UM-Ann Arbor, UM-Dearborn, and UM-Flint) includes all regular faculty and staff who have full-time appointments, which may cover 12 months or 8 months or others. The report [13] does not include graduate students, other student intern appointments, appointments without effort, overload appointments, and temporary staff appointments. Not included in the report are monies that may be received by some University staff, such as: summer teaching and/or summer research, overload appointments, extension teaching, awards for distinguished professors, incentive payments, clinical service, overtime, and temporary administrative differentials. The report indicates that at the UM-Ann Arbor campus the highest salary is \$871250 and the lowest is \$23315, and the difference is 37.4X. Similarly, at UM-Dearborn the highest salary is \$348798 and the lowest is \$28970, and the difference is 12.0X. Finally, at UM-Flint, the highest salary is \$309000 and the lowest is \$25103.09, and the difference is 12.3X. The histogram of the salary of the UM-Ann Arbor faculty and staff with full-appointments is shown in Fig. 5. The total number of faculty and staff is 31277. Clearly, the distribution shown in Fig. 5 does not belong to the traditional normal or Weibull distributions. First, there is a threshold on the left side, which is \$23315 as the lowest salary. In Fig. 5, the bars with salaries above \$330000 are barely discernible because of the significant decreasing trends towards the higher salary side. The fact that more people are concentrated on the left side in the data pattern suggests a Pareto power law would be applicable for the set of data.

In order to test if the Pareto power law distribution fits the data well, the complementary cumulative distribution function (C-CDF), Eq. (6), as a function of salary are shown in Fig.6 in a log-log plot. In the calculation [x.sub.min] is set as \$80000. It is found that the data of UM-Dearborn and UM-Flint show decent linear behaviors. By contrast, the salary of UM-Ann Arbor shows an overall good linearity, but deviate from the linearity at the right tail (high salary side) as well as in the middle. Overall, the Pareto law can be used to fit the data and the plots are shown in Fig.7. The median rank results obtained using Benard's approximation, Eq. (1), and the predicted results using Eq. [(3).sub.1] match very well. The values of the fit parameter [alpha] and the related information are listed on Table 6.

3.4. City Population

The population distribution of cities in 3 countries: China, the USA, and the UK, is analyzed here. The data are taken from the website [14], which has compiled a large amount of data from various sources. The histograms of the city population of the 3 countries are shown in Fig. 8(a), (b), and (c), respectively. Clearly, the small cities are much more than larger cities. The city population in China was reported in November 1, 2010 and the largest city is Shanghai (20,217,748), followed by Beijing (16,446,857). The city population in the USA was reported in July 1, 2015, and the largest city is New York (8,009,185), followed by Los Angeles (3,485,398). The city population in the UK was reported in June 30, 2011, and the largest city is London (8,618,552), followed by Birmingham (1,115,791). In order to test if the Pareto power law distribution fits the data well, the complementary cumulative distribution functions (C-CDF), Eq. (6), as a function of city population are shown in Fig. 9 in a log-log plot. It is found that the data of the USA shows a decent linear behavior. By contrast, the city population of China and the UK in particular show a good linearity at small city population but deviate from the linearity for large city population. Overall, the Pareto law can be used to fit the data and the plots are shown in Fig. 10. The values of the fit parameter [alpha] and the related information are listed on Table 7. The median rank results obtained using Benard's approximation, Eq. (1), and the predicted results match very well.

4.4. Case-IV: Earthquakes

Worldwide earthquake with magnitude of 6.5 or above and the conterminous USA with magnitude of 4 or above are analyzed here. The conterminous U.S refers to a rectangular region including the lower 48 states and surrounding areas which are outside the U.S. The data are taken from the website [15]. The worldwide earthquakes collected occurred between January 1, 1900 and July 17, 2016. The conterminous USA earthquakes occurred between January 1, 1980 and July 17, 2016. The largest recorded earthquake worldwide during this time period is the one that occurred in Chile in 1960 with magnitude of 9.6. The histograms of the two earthquakes are shown in Fig.11 (a), and (c), respectively. Fig. 11 (b) is the same as Fig.11 (c) but it starts from magnitude of 8.0, so that the events that are too small to be shown in Fig. 11 (a) can be revealed. Similar to the city population, the small earthquakes occur much more than larger earthquakes. In order to test if the Pareto power law distribution fits the data well, the complementary cumulative distribution functions (C-CDF) as a function of earthquake magnitude are shown in Fig. 12 in a log-log plot. It is found that the data of the worldwide earthquake shows a good linear behavior, but the conterminous USA shows a clear "bending down" phenomena. The Pareto law is used to fit the data and the plots are shown in Fig. 13. The values of the fitting parameter a and related information are listed on Table 8.

5. DISCUSSION

The earthquake in conterminous USA and the city population in the UK and China show the "Dragon King" phenomena. Definitely, uncertainty is higher for extreme events such as mega cities and earthquakes. However, the uncertainty in the extreme large events cannot not completely conceal the general trends as shown in Fig. 9 and Fig. 12. In order to empirically describe the "Dragon King" phenomena the power law distribution should be improved by introducing more parameters, which could provide not only a better fit to the observed data but also the fundamental underlying mechanisms. For example, the size of the mega city is much dependent on the location, climate, geopolitical condition, and policy. Additionally, global occurrence of magnitude 9 earthquakes is 1-3 per century [16, 17], so longer time is required to collect more reliable data, which limits the capability in prediction and forecasting of the extreme/rare events. Furthermore, the salary of UM-Ann Arbor shows not only the "Dragon King" phenomenon at the right tail but also a bumped "Dragon Body" in the salary range of \$150000-\$220000. These phenomena must be related to the structure of the institution and many other details. Finally, the cumulative distribution function is not visually sensitive of the "Dragon King" events. This point can be clearly reflected from the fact that the CDF generally provides satisfactory visual data fits for almost all of the data studied even though the C-CDF plots show clear "Dragon Kings" phenomena. Therefore, other measures such as the probabilistic density distribution should be used as a complementary tool.

6. CONCLUSIONS

1. Probabilistic events can be categorized into spectrum events and extreme/rare events. The mean behavior of data can be described with the conventional two-parameter Weibull distribution. Extreme and rare events are located at either the left tail or right tail of a probabilistic distribution and can often be modeled with the Pareto distribution.

2. Based on the combination of commonality/rarity and impact/effect, probabilistic events can be categorized into four events quadrants: (a) Quadrant-I: Rare-High Impact, (b) Quadrant-II: Rare-Low Impact, (c) Quadrant-III: Common-Low Impact, and (d) Quadrant-IV: Common-High Impact.

3. A new damage density function is proposed to differentiate the spectrum events and extreme/rare events, and several examples have shown its capability in discriminating the two events.

4. Several case studies, i.e. vehicle reliability, vehicle road test score, warranty, city population in three countries, and the earthquakes worldwide and in conterminous USA have been provided to demonstrated the concepts and procedures developed in this paper.

REFERENCES

(1.) NASA/SP-2011-3421, Probabilistic Risk Assessment Procedures Guide for NASA Managers and Practitioners, Second Edition, Editors: Stamatelatos M. and Dezfuli H. (2011).

(2.) Wei, Z. Gao, Luo L., Nikbin L., K. Modeling and analysis of extreme/rare events and spectrum events, Tenneco Internal Report, 2016.

(3.) Newman, M.E.J., Power laws, Pareto distributions and Zipf's law, Contemporary Physics, 2005, 46, 323-351.

(4.) Pisarenko, V. F., Sornette, D., Robust statistical tests of Dragon-Kings beyond power law distributions, The European Physical Journal Special Topics, 2012, 205, 95-115.

(5.) Janczura, J., Weron, R., Black Swans or Dragon-Kings? A simple test for deviation from the power law, The European Physical Journal Special Topics, 2012, 205, 79-93.

(6.) Taleb, N. N., The Black Swan: The Impact of the Highly Improbable, Penguin, London, UK,1997.

(7.) Bernard, A., Bosi-Levenbach, E. C., The plotting of observations on probability paper, Stat. Neederlandica, 1953, 7, 163-173.

(8.) Gudmundsson, A., Elastic energy release in great earthquakes and eruptions, Frontiers in Earth Science, 2014, 2, Doi:10.3389/feart.2014.00010.

(9.) http://www.consumerreports.org/cro/index.htm

(10.) http://www.jdpower.com/

(11.) Ryan, B., Joiner, B., Cryer, J., Minitab Handbook: Updated for Release 16, Sixth Edition, Boston, MA, USA, 2012.

(12.) Gurel, U., Cakmakci, M., Impact of reliability on warranty: a study of application in a large size company of electronics industry, Measurement, 2013, 46, 1297-1310.

(13.) https://hr.umich.edu/sites/default/files/salary-disclosure-2015.pdf

(14.) www.citypopulation.de

(15.) http://earthquake.usgs.gov/earthquakes/

(16.) McCaffrey, R., Global frequency of magnitude 9 Earthquakes, Geology, 2008, 36, 263-266.

(17.) McCaffrey, R., The next great earthquake, Science, 2007, 315, 1675-1676.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of SAE International.

Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE International. The author is solely responsible for the content of the paper.

Zhigang Wei

Tenneco Inc.

Kamran Nikbin

Imperial College London

doi:10.4271/2017-01-0236
```Table 3. Failures along months since the sales in January 2009.

Sales Quantities     Cumulative failure
(10243 in total)     percentage (%)

Jan-09             0                0
Feb-09             9                0.09
Mar-09            33                0.41
Apr-09            47                0.87
May-09            43                1.29
Jun-09            31                1.59
Jul-09            47                2.05
Aug-09            29                2.33
Sep-09            46                2.78
Oct-09            47                3.24
Nov-09            64                3.87
Dec-09            37                4.23
Jan-10            58                4.79
Feb-10            58                5.36
Mar-10            42                5.77
Total            591                5.77
failure

Table 4. The values of the fit parameters in the Weibull distribution
functions

Distribution functions   Parameters      Fit values

2-P Weibull              [beta]            1.95133
[eta]            58.3829
3-P Weibull              [beta]            1.13
[eta]           164.98
a([delta])        1.67

Table 5. The short-term and long-term forecasted values of the
cumulative distribution functions based on the two-parameter and the
three-parameter Weibull distributions

CDF          2-P Weibull   3-P Weibull   Observed

5th month     0.82%         1.21%         1.29%
50th month   52.24%        22.10%          --

Table 6. Pareto power law fit for the salary of faculty and staff at
the University of Michigan.

Campus         [x.sub.min]   The number of                    [alpha]
(\$)           faculty and staff
with salary equal to
or greater than [x.sub.min]

UM-Ann Arbor   80000                 9285                     3.4803
UM-Dearborn    80000                  222                     4.2948
UM-Flint       80000                  133                     4.1041

Table 7. The values of fit parameter for the city population in China,
the USA, and the UK.

Country   The          [X.sub.min]   [alpha]
number
of cities,
n

China     128          750000        2.2595
USA       315          100000        2.4359
UK        198           50000        2.3690

Table 8. The fit values of parameter [alpha] for the earthquakes

Earthquake region   The number       [x.sub.min]   [alpha]
of earthquakes
n

Worldwide           3919             6.5           18.4903
Conterminous USA    3906             4.0           11.3662
```