An overview of the SampleSTAT's Distribution Functions
The test for normality is a commonly needed procedure, since many of the statistical procedures are assumed to be applied to normally distributed data. SampleSTAT's routines require normally distributed data.
![]() | However measured data in natural sciences are commonly naturally distributed. Therefore, a normal distribution can be assumed in these cases. But it can make sense to check the normal distribution anyway. |
The images below shows normally and non-normally distributed sample distributions as histograms/step charts.
Image source: http://www.biostathandbook.com/normality.html
or as boxplots.
Image source: Wikimedia
Boxplots or histograms are useful to distinguish normal from non-normally distributed data but they are sometimes hard to interpret by unexperienced users and limited on small sample sizes smaller than 20 values. For this sizes "Individual Value Plots" are more significant.
In general, the test for normality can be achieved by applying a goodness-of-fit method (i.e. chi-square test, or Kolmogorov-Smirnov test). These two tests, however, do not perform well (the power of these tests is not too high). Therefore some other tests have been developed, which have various advantages but also some drawbacks: the power of the Shapiro-Wilk test is good, but the calculation procedure is rather cumbersome but with computers it is easy to do now. It returns easy to interpret results: True (normal distribution) or false (non-normal distribution).
The Shapiro-Wilk test is a statistical significance test that tests the hypothesis that the underlying population of a sample is normally distributed. The Shapiro-Wilk test exhibiting high power, leading to good results even with a small number of observations. In contrast to other comparison tests the Shapiro-Wilk test is only applicable to check for normality.
The test can be used for sample sizes from 3 to 50 values.
![]() | The test reacts very sensitively to outliers, both for one-sided and two-sided ones. Outliers can strongly distort the distribution pattern so that the normal distribution assumption could be erroneously rejected. The test is relatively susceptible to Ties, i.e. if there are many identical values, the test strength is strongly affected. |
![]() | Although the Shapiro-Wilk test has a big test strength, especially for smaller sample sizes, it should not be used blindfolded for the reasons mentioned above. Check the results graphically with histogram, QQ-plot or box-plot for sample sizes from 20 and up or with individual value plots for smaller sample sizes. Box-plots and QQ-plots are provided in the toolbox STIXBOX. |
The basis idea behind the Shapiro-Wilk test is to estimate the variance of the sample in two ways: (1) the regression line in the QQ-Plot allows to estimate the variance, and (2) the variance of the sample can also be regarded as an estimator of the population variance. Both estimated values should approximately equal in the case of a normal distribution and thus should result in a quotient of close to 1.0. If the quotient is significantly lower than 1.0 then the null hypothesis (of having a normal distribution) should be rejected.
The sample of size n (x1,x2,...xn) has to be sorted in increasing order, the resulting sorted sample will be designated by y1,y2,...yn (y1 < y2 < ... < yn).
Calculate the sum
a) if n is even, then b is calculated using:
b) if n is odd, b is calculated by using k=(n-1)/2, the median must not be included. The parameters an-i+1 depend on the sample size and have to be taken from a table published by Shapiro and Wilk
Calculate the test statistic W:
If the test statistic W is smaller than the critical threshold the assumption of a normal distribution has to be rejected. ST_shapiriwilk returns %T (true) if the data is normally distributed and %F (false) if not.
Individual value plots (IVP) are well suited for evaluating and comparing distributions of sample data. A IVP displays a point for the actual value of each observation in a group, making it easy to identify outliers and see the dispersion of the distribution. A IVP is especially recommended for small sample sizes in comparison to histograms, box-plots and QQ-plots, which need at least 20 values to be significant.
Therefore IVPs are well suited to test very small sample sizes on normal distribution when outliers or ties could be present and the Shapiro-Wilk distribution test cannot be reliably applied.
If your sample size is larger than 50 values you should use a histogram or box-plot instead.
![]() | Please be advised that ST_ivplot is EXPERIMENTAL and very basic. It can just handle one data set at a time at the moment. |
Hani A. Ibrahim - hani.ibrahim@gmx.de
Most of the text and one image are from Lohringer, H., "Fundamentals of Statistics", Oct, 10th, 2012, http://www.statistics4u.info/fundstat_eng/
Lohringer, H., "Grundlagen der Statistik", Oct, 10th, 2012, http://www.statistics4u.info/"
Shapiro, Wilk: "An Analysis of Variance Test for Normality", Biometrika, Vol. 52, No. 3/4. (Dec., 1965), pp. 591-611.