<< Distfun Distfun Tutorial >>

Distfun >> Distfun > Distfun

Distfun

An overview of the Distfun toolbox.

Purpose

The goal of this toolbox is to provide accurate distribution functions. The provided functions are designed to be compatible with Matlab.

The goals of this toolbox are the following.

The design is similar to Matlab's distribution functions. A significant difference with Matlab's function is that both the upper and lower tails are available in the CDF of the "distfun" toolbox, while Matlab only provides the lower tail. Hence, "distfun" should provide a better accuracy when probabilities close to 1 are computed (e.g. p=1 - 1.e-4).

The differences with Scilab is that a consistent set of functions is provided. First, Scilab currently does not provide the PDFs. Users may write their own functions: this is not as easy as it seems, and may lead to very innaccurate results if floating point issues are ignored. Secondly, Scilab does not provide a consistent sets of functions: the CDF and the random number generators are provided in two different toolboxes, with no consistency. Thirdly, Scilab requires that we provide the argument Q (which is mathematically equal to 1-P), no matter if we want the lower or the upper tail. Fourth, the inverse CDF functions in Scilab do not manage extreme values of p (i.e. zero or one). Fifth, inverse discrete distributions produces doubles with fractional values instead of integer values (e.g. 1.3234887 instead of 2, with the Poisson distribution, for example). Sixth, the cdf* functions allows to compute one parameter from the others, which is unnecessary for the parameters of the distribution (which is computed, in practice, either with the moments methods, or with the maximum likelyhood function, for example).

The difference with Stixbox is that the current function are tested, accurate, with consistent help pages.

The toolbox is based on both macros and compiled source code.

Distributions

For each distribution "x", we provide five functions :

Distributions available :
Distribution x
Beta beta
Binomial bino
Chi-Squared chi2
Extreme Value ev
Exponential exp
F f
Gamma gam
Geometric geo
Hypergeometric hyge
Kolmogorov-Smirnov ks
LogNormal logn
LogUniform logu
Multinomial mn
Multivariate Normal mvn
Negative Binomial nbin
Noncentral F ncf
Noncentral T nct
Noncentral Chi-Squared ncx2
Normal norm
Poisson poi
T t
Truncated Normal tnorm
Uniform Discrete unid
Uniform unif
Weibull wbl

Quick start

Reference: Introduction to probability and statistics for engineers and scientists, 3rd Edition, Sheldon Ross, Elsevier, 2004

Example 2.5a, p.33 Assume that X is a Normal random variable with mean m and standard deviation s:

m = 70.571
s = 14.354

What is the probability that X falls in the interval [m-s,m+s]?

To answer to this question, we use the distfun_normcdf function, which computes the cumulated density function of the normal distribution. We first compute P(X<m-s), P(X<m+s) and then compute the difference to evaluate P(m-s<X<m+s).

p1=distfun_normcdf(m-s,m,s)
p2=distfun_normcdf(m+s,m,s)
p2-p1

The previous script produces the following output.

      -->p1=distfun_normcdf(m-s,m,s)
      p1  =
      0.1586553
      -->p2=distfun_normcdf(m+s,m,s)
      p2  =
      0.8413447
      -->p2-p1
      ans  =
      0.6826895
    

Therefore, approximately 68% of the outcomes should fall in the interval [m-s,m+s].

Example 5.1a. A store sells disks, which are defective with probability 0.01 independently of each other. The store sells them in packages of 10 disks. The store offers a new package if more than 1 disk is defective in the package.

What proportion of packages is replaced? Let X be the number of defective disks in a package. We assume that X is a binomial random variable, with parameters N=10 and pr=0.01. We are searching for P(X>1). To do this, we use the lowertail option of the distfun_binocdf function.

pfail = distfun_binocdf(1,10,0.01,%f)

The previous script produces the following output.

      -->pfail = distfun_binocdf(1,10,0.01,%f)
      pfail  =
      0.0042662
    

We could as well evaluate 1-P(X≤1):

pfail = 1-distfun_binocdf(1,10,0.01)

which returns almost the same result (but not exactly). In fact, this may be less numerically accurate if the probability is small (see the section "The upper and lower tails" below for details on this topic). Notice that Ross indicates .005 in the text, which is not as not correctly rounded.

Therefore, approximately 0.4% of the packages will be replaced.

If I buy three packages, what is the probability that exactly one of them will have to be replaced?

Let Y be the number of defective packages in a set of 3 packages. We assume that Y is a binomial variable with parameters N=3 and pr=pfail.

pfail = 1-distfun_binocdf(1,10,0.01)
distfun_binopdf(1,3,pfail)

The previous script produces the following output.

      -->distfun_binopdf(1,3,pfail)
      ans  =
      0.0126896
    

In other words, the probability that exactly one package contains a defective disk out of three packages is 1.2%.

The PDF and the CDF

For each PDF, we present the mathematical definition of the function that we use. This removes potential ambiguities.

In general, we do not give the mathematical definition of the CDF function. This is because it can be derived directly from the PDF by integration (for a continuous distribution) or by summation (for a discrete distribution).

Assume that the continuous PDF function f has the support (-inf,inf). Therefore, the CDF function F is defined by:

for any real value x. This situation is presented in the following figure, where the CDF is the grey area (i.e. the integral) under the PDF.

For example, the following script plots the PDF and the CDF of the Normal distribution, with parameters μ=0 and σ=1.

x = linspace(-5,5,1000);
y = distfun_normpdf(x,0,1);
p = distfun_normcdf(x,0,1);
scf();
plot(x,y,"r")
plot(x,p,"b")
xtitle("Normal Distribution","x");
legend(["f(x)" "F(x)"]);

The previous script produces the following figure.

Secondly, assume that the discrete PDF function f has the support x≥0, where x is an integer. Therefore, the CDF function F is defined by:

for any integer positive value x.

For example, the following script plots the PDF and the CDF of the Binomial distribution, with parameters N=20 and pr=0.5..

x = (0:20)';
y = distfun_binopdf(x,20,0.5);
p = distfun_binocdf(x,20,0.5);
scf();
plot(x,y,"ro-")
distfun_plotintcdf(x,p,"b");
xtitle("Binomial Distribution","x","");
legend(["$P(X=x)$" "$P(X\leq x)$"]);

The previous script produces the following figure.

Inversion of discrete CDF

In this section, we analyse why it is not straightforward to invert a CDF when the distribution is discrete.

Let us first consider the case where the distribution is continuous. Let us denote by F the CDF and by G the inverse CDF. Assume that x is the point where we evaluate the CDF:

where x is an integer. Therefore, the inverse CDF satisfies:

Secondly, let us consider the case where the distribution is discrete. Let us denote by x the output of the inverse CDF:

The quantile x is the smallest x such that p≤F(x). Therefore, x satisfies the inequalities:

This situation is presented in the following figure.

The important fact is that the computed value of F(x) may not the one which achieves the minimum distance to p. As presented in the previous figure, the value of F(x-1) may be closer to p, even if x is the correct inverse.

As an example of this situation, let us consider the following example, which is based on the Hypergeometric function.

M=1030;
k=515;
N=500;
p=1.6013707e-280;
x=distfun_hygeinv(p,M,k,N)
distfun_hygecdf(x,M,k,N)

The previous script produces the following output.

      -->x=distfun_hygeinv(p,M,k,N)
      x  =
      1.
      -->distfun_hygecdf(x,M,k,N)
      ans  =
      2.57D-276
    

The previous result may seem weird, since the value of the CDF at x=0 is much closer to the value of p.

      -->distfun_hygecdf(0,M,k,N)
      ans  =
      1.60D-280
    

But this is a consistent result, since the inequalities are satisfied.

p=1.6013707e-280;
M=1030;
k=515;
N=500;
x=1;
distfun_hygecdf(x-1,M,k,N) < p
p <= distfun_hygecdf(x,M,k,N)

Notice that, in the previous result, the 7th digits in p completely changes the quantile x. The previous analysis explains why, in the following script, the variables x and xx may not be equal.

M=80;
k=50;
N=30;
x=0
p=distfun_hygecdf(x,M,k,N)
xx=distfun_hygeinv(p,M,k,N)

The previous script produces the following output.

      -->x=0
      x  =
      0.
      -->p=distfun_hygecdf(x,M,k,N)
      p  =
      1.127D-22
      -->xx=distfun_hygeinv(p,M,k,N)
      xx  =
      1.
    

Indeed, the rounding errors in the evaluation of distfun_hygecdf may combine in such a way that the quantile jumps from one value to another.

We now consider the properties of the complementary quantile. Let us denote q=P(X>x) the probability that we are searching for. We have q=1-p, where p=P(X≤x). The complementary quantile x is the smallest x such that q≥1-F(x). Therefore, x satisfies the inequalities:

This situation is presented in the following figure.

The upper and lower tails

In this section, we present the purpose of the lowertail option of the distfun_*cdf functions.

The lower tail of the CDF is the function P(X≤x). The upper tail of the CDF (i.e. the complementary CDF) is the function P(X > x).

The following script plots the Normal CDF and complementary CDF functions.

x = linspace(-5,5,1000);
p = distfun_normcdf(x,0,1);
q = distfun_normcdf(x,0,1,%f);
scf();
plot(x,p,"r")
plot(x,q,"b")
xtitle("Normal CDF","x");
legend(["$P(X\leq x)$" "$P(X>x)$"]);

The previous script produces the following output.

Mathematically, we have

But the numerical evaluation of the previous equation may lead to a different result in some cases.

Let us consider the distribution function distfun_dcdf, where d is some distribution (e.g. d=norm). Hence, for most arguments x, we have

      distfun_dcdf(x,%f) almost-equal 1-distfun_dcdf(x,%t)
    

but there can be a significant difference between the two numerical evaluations in some cases.

The lowertail argument of the distfun_dcdf function controls the tail which is computed. The default value, lowertail=%t computes the lower tail of the CDF, i.e. P(X≤x). When lowertail=%f, we compute the upper tail of the CDF, i.e. P(X>x).

This option is illustrated in the following example.

1-distfun_normcdf(3,4,1)
distfun_normcdf(3,4,1,%f)

The previous example produces the following output.

      -->1-distfun_normcdf(3,4,1)
      ans  =
      0.8413447
      -->distfun_normcdf(3,4,1,%f)
      ans  =
      0.8413447
    

But the two computations can be numerically different, when the value of the CDF is close to 1. To see this, let us consider the following example.

1-distfun_normcdf(15,4,1)
distfun_normcdf(15,4,1,%f)

The previous example produces the following output.

      -->1-distfun_normcdf(15,4,1)
      ans  =
      0.
      -->distfun_normcdf(15,4,1,%f)
      ans  =
      1.911D-28
    

The correct result is 1.911D-28.

The previous session is explained by the limitation of the representation of floating point numbers which are close to 1. Indeed, the computed value of distfun_normcdf(15,4,1) is mathematically so close to 1 that it is numerically rounded to 1. Hence, the expression 1-distfun_normcdf(15,4,1) is evaluated as zero. Instead, the lowertail=%f option performs the correct computation.

Whenever the value of P(X≤x) is greater than, say, 0.99, there is some rounding in 1-P(X≤x) which reduces the number of significant digits of this computation. When the probability P(X≤x) comes closer and closer to 1, each time a new 9 digit is added, the computed result 1-P(X≤x) loses a significant decimal digit, until a point where there is no significant digit anymore. On the other hand, when we directly compute P(X>x), most the significant digits are correct.

Hence, the lowertail=%f argument should be used to compute P(X>x) when the associated probability is very small, that is, when P(X≤x) is close to 1.

Integer parameters

In this section, we present why some functions were numerically extended to manage non-integer parameters.

Some distributions are associated with parameters, which mathematically appear as integers. For some of these distribution, it appears that the evaluation is based on functions which can be used for non-integer values. This is why these distributions were extended, so that non-integer values can be used for the parameters. This feature improves the compatibility with other environments. Indeed, other softwares (such as R or Matlab), provide this feature.

The distribution for which non-integer values of the parameters are available are:

On the contrary, there are distributions for which the parameters have integer values, and it would be an error to do otherwise.

The distribution for which only integer values of the parameters are available are:

Whenever possible, large values of the parameters are authorized, even if these values are larger than 2^31. In other words, the C "int" datatype is not used in the implementation whenever this is possible, in order to avoid to limit the range of available values of the parameters of the distribution. If the parameter must have an integer value, the function checks that the double is a floating point number which has no fractional part.

In the following example, we use the hypergeometric function with parameters which are much larger than 2^31.

-->x=20;
-->M=1.e10;
-->k=50;
-->N=4.e9;
-->distfun_hygepdf(x,M,k,N)
 ans  =
    0.1145586

Authors


Report an issue
<< Distfun Distfun Tutorial >>