<< polymultiindex Regression regressprint >>

stixbox >> stixbox > Regression > regress

regress

Multiple linear regression

Calling Sequence

B = regress(y,X)
B = regress(y,X,level)
[B,bint] = regress(...)
[B,bint,r] = regress(...)
[B,bint,r,rint] = regress(...)
[B,bint,r,rint,stats] = regress(...)
[B,bint,r,rint,stats,fullstats] = regress(...)

Parameters

y :

a m-by-1 matrix of doubles, the responses.

X :

a m-by-n matrix of doubles, the inputs, where m is the number of observations and n is the number of variables.

level :

a 1-by-1 matrix of doubles, the confidence level (default level=0.05, meaning that the default level is 95%). level is expected to be in the range [0.,0.5]

B :

a n-by-1 matrix of doubles, the coefficients of the linear model.

bint :

a 2-by-2 matrix of doubles, intervals with confidence level. The column bin(:,1) are the lower bounds and bin(:,2) are the upper bounds.

r :

a m-by-1 matrix of doubles, the residuals y-X*B

stats :

a 4-by-1 matrix of doubles, the statistics. stats(1) is the R2 statistics, stats(2) is the F statistics, stats(3) is the p-value of the F statistics, stats(4) is an estimate of the error variance.

fullstats :

a struct, the statistics, see below for details.

Description

This function computes a linear model with n independent variables x1, x2, ..., xn which best fit the data in the least squares sense. The method is based on the SVD decomposition. This allows to manage rank-deficient problems.

The linear model we compute is

y = B(1)*x1 + B(2)*x2 +...+ B(n)*xn.

The fields in fullstats are :

The residual r is :

r = y-X*B
The residual sum of squares ResidualSS is :
ResidualSS = sum(r.^2)
The Log-Likelihood is :
LogLikelihood = -0.5*m*(log(2*%pi*sigma2)+1)
where sigma2 is the mean residual sum of squares :
sigma2 = ResidualSS/m

If a linear model has a good fit, the residual sum of squares is close to zero (and positive), which produces a Log-Likelihood which is large (and negative).

Assuming that X and y are column vectors, the simplest way to perform a simple linear regression is to use the statement

B = regress(y,[ones(X),X])
This adds a column of 1s, so that the model has a constant term.

The statistics fullstats.R2 is in the range [0,1]. On output, the coefficient of determination fullstats.R2 measures the proportion of the variation in the response variables that is explained by the different input values. A value of fullstats.R2 near 1 indicates a good fit, while a value near zero indicates a poor fit.

The F statistics tests the null hypothesis B(1)=B(2)=...=B(n)=0 against the alternative hypothesis that there is one nonzero B(i), for some i=1,2,...,n. The F test does not tell which of the B(i) is nonzero, but only that one of them is linearly related to the response. When the P-value for the F test statistic is small (e.g. less than 0.001), this provides an evidence against the null hypothesis. In other words, if the P-value is small, therefore one of the parameters B(i) is linearly related to the response.

Caution! If we are interested in F-statistics and its p-value, then the X input argument should contain a column of 1s, so that the model has a constant term.

Compatibility note

This function is mostly compatible with Matlab, except that Matlab has no fullstats output argument, which allows to have much more outputs than the entries in stat.

In R, the feature is provided by the lm function and the function anova prints the analysis of variance table. In this toolbox, this feature is provided by the regresprint function, which uses the fields in the fullstats variable. This would not have been possible with the stats variable alone.

The regress function name was chosen because regress already exists in Scilab, but Scilab's regress is not Matlab-compatible, and has much less features.

Examples

// A simple linear regression
X = [
57.
64.
69.
82.
92.
111.
114.
132.
144.
146.
];
Y = [
121.
129.
140.
164.
188.
217.
231.
264.
289.
294.
];
B = regress(Y,[ones(X),X])
expected = [4.8222564;1.9671389];
// Visually check the result:
scf();
plot(X,Y,"bo");
L = B(1)+B(2)*X;
plot(X,L,"r-")
legend(["Data","Linear Fit"]);
xtitle("Linear Regression","X","Y");

// Compute a 95% confidence interval.
// Visually check the results.
// Visually check the residuals.
// Reference : [2]
Height = [
1.47    1.50    1.52    1.55    1.57    ..
1.60    1.63    1.65    1.68    1.70    ..
1.73    1.75    1.78    1.80    1.83
];
Mass =[
52.21    53.12    54.48    55.84    57.20    ..
58.57    59.93    61.29    63.11    64.47    ..
66.28    68.10    69.92    72.19    74.46
];
Height = Height(:);
Mass = Mass(:);
[B,bint] = regress(Mass,[ones(Height),Height])
Bexpected = [
-39.062
61.272
];
Bintexpected = [
-45.4 -32.7
57.4 65.1
];
// Visually check the result:
scf();
plot(Height,Mass,"bo");
L = B(1)+B(2)*Height;
plot(Height,L,"r-")
legend(["Data","Linear Fit"]);
xtitle("Linear Regression","Height","Mass");
// Compute the residuals
[B,bint,r] = regress(Mass,[ones(Height),Height]);
// Visually see the residuals
scf();
plot(Height,r,"bo");
xtitle("Linear Regression","Height","Residuals");
// A quadratic fit would be better:
// we see a quadratic pattern in the residuals.

// Longley.dat contains 1 Response Variable y,
// 6 Predictor Variables x and 16 Observations.
// Source : [4]
[data,txt] = getdata(24);
Y=data(:,1);
X=data(:,2:7);
[B,bint,r,rint,stats,fullstats] = regress(y,X)
// Print an analysis of variance table
regresprint(fullstats)

// Create an example with known coefficients
// The number of observations
N = 100;
// The exact coefficients
Bexact = [2000;3;4]
// The input 1
X1min = 50; X1max = 150;
X1=distfun_unifrnd(X1min,X1max,N,1);
// The input 2
X2min = -500; X2max = -200;
X2=distfun_unifrnd(X2min,X2max,N,1);
// Make the observations
X = [ones(X1),X1,X2];
// Compute the ouput from the input
y = X*Bexact;
// Add a normal noise to the output
sigma = 50;
e = distfun_normrnd(0,sigma,N,1);
y = int(y + e);
// Linear regression
[B,bint,r,rint,stats,fullstats] = regress(y,X);
// Compare B (exact) with B (estimate)
[B,Bexact]
L = X*B;
// Check visually
scf();
plot(y,L,"bo");
plot(y,y,"r-");
title(msprintf("Regression - R2=%.2f%%",fullstats.R2*100))
xlabel("Observations");
ylabel("Predictions");

// A short example of univariate linear regression
N=30;
X = distfun_normrnd(0.,1.,N,1);
noise = distfun_normrnd(0.,1.,N,1);
Y = 2+3*X+noise;
model=[ones(X),X];
B = regress(Y,model);
scf();
plot(X,model*B,"r-")
plot(X,Y,"bo")
xlabel("X")
ylabel("Y")
legend(["Regression","Observations"],loc=2);
// To see which points are there, please use:
// identify(X,Y,1)
// Click middle button to leave.

// Draw Parity Plot
[B,bint,r,rint,stats,fullstats] = regress(Y,model);
scf();
title(msprintf("Regression - R2=%.2f%%",fullstats.R2*100))
plot(Y,model*B,"bo")
plot(Y,Y,"r-")
xlabel("Observations")
ylabel("Predictions")

Authors

Bibliography

[1] "Introduction to probability and statistics for engineers and scientists.", Third Edition, Sheldon Ross, Elsevier Academic Press, 2004

[2] http://en.wikipedia.org/wiki/Linear_regression

[3] Octave's regress, William Poetra Yoga Hadisoeseno

[4] http://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Longley.dat


Report an issue
<< polymultiindex Regression regressprint >>