<< Outlier-Tests Outlier-Tests Tools >>

samplestat >> samplestat > Outlier-Tests > Overview - Outlier Tests

Overview - Outlier Tests

An overview of the SampleSTAT's Outlier Functions

Introduction

Outliers are extreme values that stand out from the other values of a sample. Outliers normally have a considerable influence on the calculation of statistics (see e.g. the leverage effect with linear regression) and should be removed in most cases. You should also note that outliers may result simply from the fact that you assume a distribution which does not fit the real distribution of the data.

Typical examples of outliers are errors in measurement, errors in acquisition (human influences...), or (rare) outstanding values. An important question concerning outliers is whether it is legitimate to remove a particular value after it has been recognized as an outlier. Of course, statistical tests cannot decide, if it is appropriate to remove such values. They can only give you a hint if a significant deviation exists (basically, outlier tests are based on the probabilities to belong to the assumed distribution).

Tests on outliers in data sets can be used to

Several outlier tests are available, each of them having its own special advantages and drawbacks. Following is a short description of the most commonly used strategies to identify outliers:

Nonetheless there is one common rule which should be strictly obeyed: do not eliminate outliers successively. To be specific, do not identify an outlier, eliminate it and recalculate the statistics to perform another outlier detection and elimination:

All outliers must be removed in one single calculation.

The outlier tests of SampleSTAT are applicable for normally distributed data ONLY.

Test recommendations

Dean-Dixon (ST_deandixon):

Best for small samples sizes (less than 30). It can just consider one value on each side of the sorted data an outlier which is not a problem for small data sets.

Pearson-Hartley's Significance of Extreme Values (ST_personhardley):

For larger sample sizes (more than 30)

Nalimov (ST_nalimov):

For small and larger sample sizes. It has a very strict algorithm and is controversially discussed. It was very common in Germany. Use it with care.

Basic rules (ST_outlier):

Up from medium sample sizes (larger than 10, better more than 25).

Threshold are based on population standard deviation ("sd" mode) or interquantile range distance ("iqr15", "iqr30" modes). The latter allows to distinguish weak from strong outliers.

Important Note:

\begin{eqnarray}
	\bar{x}: \text{arithmetic mean} \quad \Rightarrow \quad \bar{x} = {1 \over{n}}\sum_{i=1}^{n}x_i \\
	s: \text{sample standard deviation} \quad \Rightarrow \quad s = \sqrt{{1 \over{n-1}}\sum_{i=1}^{n}(x_i-\bar{x})^2} \\
	\sigma: \text{population standard deviation, S.D.} \quad \Rightarrow \quad \sigma = \sqrt{{1 \over{n}}\sum_{i=1}^{n}(x_i-\bar{x})^2}
	\end{eqnarray}

Dean-Dixon (ST_deandixon)

A test for outliers of normally distributed data which is particularly simple to apply has been developed by J.W. Dixon. This test is eminently suitable for small sample sizes (less than 30 samples); for samples having more than 30 observations the test for significance of Pearson and Hartley can be used as well.

It sorts the distribution in ascending or descending order, then takes the minimum and maximum values (xi) and calculates the respective Q value for both xi values. This is compared with the critical value from a table (Qcrit). If one of the two or both Q values greater than the corresponding Qcrit value, one or both xi values are outliers.

\begin{eqnarray}
Q = \left | x_{i+1}-x_i \right |/\left | x_n-x_i \right | \quad ; \quad Q > Q_{crit} \quad \Rightarrow \quad x_i = \text{outlier}
\end{eqnarray}

Only one outlier can be found on each side of the sorted distribution.

Significance of Extreme Values (ST_pearsonhartley)

For random samples larger than 30 objects possible outliers may be identified by using the significance thresholds of Pearson and Hartley.

xi is regarded to be an outlier if the test statistic q exceeds the critical threshold qcrit for a given level of significance alpha (statistical confidence level) and a sample size n.

\begin{eqnarray}
q=\left| \frac{1}{s}(x_i - \bar{x}) \right| \quad ; \quad q > q_{crit} \quad  \Rightarrow \quad x_i = \text{outlier} \\
x_i: \text{test value} \quad ; \quad \bar{x}: \text{arithmetic mean} \quad ; \quad s: \text{sample standard deviation}
\end{eqnarray}

Nalimov (ST_nalimov)

Nalimov is applicable for data sets between 3 and 1000 values. It calculates for all values the test value "q". It compares these q-values with the appropriate qcrit value from a table (Kaiser/Gottschalk 1972).

\begin{eqnarray}
q = \left | \frac{1}{s}(x_i- \bar{x}) \right | \sqrt{\frac{n}{n-1}} \quad;\quad > q_{crit}\;\Rightarrow \; x_i=\text{outlier} \\
x_i: \text{test value} \quad ; \quad \bar{x}: \text{arithmetic mean} \\
s: \text{sample standard deviation} \quad ; \quad n: \text{number of values}
\end{eqnarray}

The Nalimov test widely known in Germany and German-speaking countries. But it has it flaws. It indicates outliers very strict. That could eliminates valid data in some circumstances. There is some scientific discussion about this test. It is thus recommended to use other tests instead of the Nalimov test (Dean-Dixon test for small samples, the test of Pearson and Hartley for larger ones).

If you want to use Nalimov, use it with care.

ST_outlier

Provide two basic tests on outliers based on population standard deviation (S.D.) or inter-quartile range (IQR). It should be applied to sample sizes larger than 10 better more than 25. For small sample sizes the Dean-Dixon test is recommended.

Based on Standard deviation:

If we assume a normal distribution, a single value may be considered as an outlier if it falls outside a certain range of the population standard deviation. In many cases a factor of 2.5 is used, which means that approx. 99 % of the data belonging to a normal distribution fall inside this range.

\begin{eqnarray}
(\bar{x} - 2.5\sigma) > x_i > (\bar{x} + 2.5\sigma) \; \text{with} \quad \sigma = \sqrt{{1 \over n}\sum_{i=1}^{n}(x_i-\bar{x})^2} \quad \Rightarrow \quad x_i = \text{outlier}\\
x_i: \text{value} \quad ; \quad \bar{x}: \text{arithmetic mean} \\
\sigma: \text{population standard deviation} \quad ; \quad n: \text{number of values}
\end{eqnarray}

Based on he interquartile range (IQR)

This approach is used for skewed data in the first place but is - of course - applicable for normally distributed data as well. It allows to distinguish between weak and strong outliers.

x0.25 is the first quartile and x0.75 the third and IQR is the difference.

\begin{eqnarray}
IQR = x_{0.75} - x_{0.25} \\
(x_{0.25} - 1.5 \cdot IQR)  < x_i < (x_{0.25} + 1.5 \cdot IQR)  \quad \Rightarrow \quad x_i = \text{outliers (iqr15 mode)} \\
(x_{0.25} - 3.0 \cdot IQR)  < x_i < (x_{0.25} + 3.0 \cdot IQR)  \quad \Rightarrow \quad x_i = \text{strong outlier (iqr30 mode)}
\end{eqnarray}

Values between the thresholds 3.0xIQR and 1.5xIQR are called weak outliers. SampleSTAT provides the modes "iqr15" (1.5xIQR) and "iqr30" (3.0xIQR) for testing on weak or strong outliers.

Please note that these basic tests require at least 10 observations (better 25, or more).

Authors

Hani A. Ibrahim - hani.ibrahim@gmx.de

Credits

Most of the text and images are from Lohringer, H., "Fundamentals of Statistics", Oct, 10th, 2012, http://www.statistics4u.info/fundstat_eng/

Bibliography

Lohringer, H., "Fundamentals of Statistics", Oct, 10th, 2012, http://www.statistics4u.info/fundstat_eng/

R. Kaiser, G. Gottschalk; "Elementare Tests zur Beurteilung von Meßdaten", BI Hochschultaschenbücher, Bd. 774, Mannheim 1972.


Report an issue
<< Outlier-Tests Outlier-Tests Tools >>