Multiple linear regression
B = regress(y,X) B = regress(y,X,level) [B,bint] = regress(...) [B,bint,r] = regress(...) [B,bint,r,rint] = regress(...) [B,bint,r,rint,stats] = regress(...) [B,bint,r,rint,stats,fullstats] = regress(...)
a m-by-1 matrix of doubles, the responses.
a m-by-n matrix of doubles, the inputs, where m is the number of observations and n is the number of variables.
a 1-by-1 matrix of doubles, the confidence level (default level=0.05, meaning that the default level is 95%). level is expected to be in the range [0.,0.5]
a n-by-1 matrix of doubles, the coefficients of the linear model.
a 2-by-2 matrix of doubles, intervals with confidence level. The column bin(:,1) are the lower bounds and bin(:,2) are the upper bounds.
a m-by-1 matrix of doubles, the residuals y-X*B
a 4-by-1 matrix of doubles, the statistics. stats(1) is the R2 statistics, stats(2) is the F statistics, stats(3) is the p-value of the F statistics, stats(4) is an estimate of the error variance.
a struct, the statistics, see below for details.
This function computes a linear model with n independent variables x1, x2, ..., xn which best fit the data in the least squares sense. The method is based on the SVD decomposition. This allows to manage rank-deficient problems.
The linear model we compute is
y = B(1)*x1 + B(2)*x2 +...+ B(n)*xn.
The fields in fullstats
are :
fullstats.RegressionSS: the sum of squares of the regression
fullstats.RegressionDof: the number of degrees of freedom of the regression.
ResidualDof = m-n-1
fullstats.RegressionMean: the mean of the sum of squares of the regression
fullstats.ResidualSS: the sum of squares of the residuals
fullstats.ResidualDof: the number of degrees of freedom of the residuals
fullstats.ResidualMean: the mean of the sum of squares of the residuals. This is equal to
ResidualMean = ResidualSS/ResidualDof
fullstats.F: the F statistics
fullstats.pval : the p-value corresponding to the F statistics
fullstats.Bstddev: a (n+1)-by-1 matrix of doubles, the standard deviation of B
fullstats.R2: the R-squared statistics
fullstats.LogLikelihood: the Log-Likelihood of the regression model with respect to the data
The residual r is :
r = y-X*B
ResidualSS = sum(r.^2)
LogLikelihood = -0.5*m*(log(2*%pi*sigma2)+1)
sigma2 = ResidualSS/m
If a linear model has a good fit, the residual sum of squares is close to zero (and positive), which produces a Log-Likelihood which is large (and negative).
Assuming that X and y are column vectors, the simplest way to perform a simple linear regression is to use the statement
B = regress(y,[ones(X),X])
The statistics fullstats.R2 is in the range [0,1]. On output, the coefficient of determination fullstats.R2 measures the proportion of the variation in the response variables that is explained by the different input values. A value of fullstats.R2 near 1 indicates a good fit, while a value near zero indicates a poor fit.
The F statistics tests the null hypothesis B(1)=B(2)=...=B(n)=0 against the alternative hypothesis that there is one nonzero B(i), for some i=1,2,...,n. The F test does not tell which of the B(i) is nonzero, but only that one of them is linearly related to the response. When the P-value for the F test statistic is small (e.g. less than 0.001), this provides an evidence against the null hypothesis. In other words, if the P-value is small, therefore one of the parameters B(i) is linearly related to the response.
Caution! If we are interested in F-statistics and its p-value, then the X input argument should contain a column of 1s, so that the model has a constant term.
Compatibility note
This function is mostly compatible with Matlab,
except that Matlab has no fullstats
output argument,
which allows to have much more outputs than
the entries in stat
.
In R, the feature is provided by the lm
function and
the function anova
prints the analysis of
variance table.
In this toolbox, this feature is provided by the
regresprint
function,
which uses the fields in the fullstats
variable.
This would not have been possible with the stats
variable alone.
The regress
function name was chosen because
regress
already exists in Scilab, but Scilab's
regress
is not Matlab-compatible, and has much
less features.
// A simple linear regression X = [ 57. 64. 69. 82. 92. 111. 114. 132. 144. 146. ]; Y = [ 121. 129. 140. 164. 188. 217. 231. 264. 289. 294. ]; B = regress(Y,[ones(X),X]) expected = [4.8222564;1.9671389]; // Visually check the result: scf(); plot(X,Y,"bo"); L = B(1)+B(2)*X; plot(X,L,"r-") legend(["Data","Linear Fit"]); xtitle("Linear Regression","X","Y"); // Compute a 95% confidence interval. // Visually check the results. // Visually check the residuals. // Reference : [2] Height = [ 1.47 1.50 1.52 1.55 1.57 .. 1.60 1.63 1.65 1.68 1.70 .. 1.73 1.75 1.78 1.80 1.83 ]; Mass =[ 52.21 53.12 54.48 55.84 57.20 .. 58.57 59.93 61.29 63.11 64.47 .. 66.28 68.10 69.92 72.19 74.46 ]; Height = Height(:); Mass = Mass(:); [B,bint] = regress(Mass,[ones(Height),Height]) Bexpected = [ -39.062 61.272 ]; Bintexpected = [ -45.4 -32.7 57.4 65.1 ]; // Visually check the result: scf(); plot(Height,Mass,"bo"); L = B(1)+B(2)*Height; plot(Height,L,"r-") legend(["Data","Linear Fit"]); xtitle("Linear Regression","Height","Mass"); // Compute the residuals [B,bint,r] = regress(Mass,[ones(Height),Height]); // Visually see the residuals scf(); plot(Height,r,"bo"); xtitle("Linear Regression","Height","Residuals"); // A quadratic fit would be better: // we see a quadratic pattern in the residuals. // Longley.dat contains 1 Response Variable y, // 6 Predictor Variables x and 16 Observations. // Source : [4] [data,txt] = getdata(24); Y=data(:,1); X=data(:,2:7); [B,bint,r,rint,stats,fullstats] = regress(y,X) // Print an analysis of variance table regresprint(fullstats) // Create an example with known coefficients // The number of observations N = 100; // The exact coefficients Bexact = [2000;3;4] // The input 1 X1min = 50; X1max = 150; X1=distfun_unifrnd(X1min,X1max,N,1); // The input 2 X2min = -500; X2max = -200; X2=distfun_unifrnd(X2min,X2max,N,1); // Make the observations X = [ones(X1),X1,X2]; // Compute the ouput from the input y = X*Bexact; // Add a normal noise to the output sigma = 50; e = distfun_normrnd(0,sigma,N,1); y = int(y + e); // Linear regression [B,bint,r,rint,stats,fullstats] = regress(y,X); // Compare B (exact) with B (estimate) [B,Bexact] L = X*B; // Check visually scf(); plot(y,L,"bo"); plot(y,y,"r-"); title(msprintf("Regression - R2=%.2f%%",fullstats.R2*100)) xlabel("Observations"); ylabel("Predictions"); // A short example of univariate linear regression N=30; X = distfun_normrnd(0.,1.,N,1); noise = distfun_normrnd(0.,1.,N,1); Y = 2+3*X+noise; model=[ones(X),X]; B = regress(Y,model); scf(); plot(X,model*B,"r-") plot(X,Y,"bo") xlabel("X") ylabel("Y") legend(["Regression","Observations"],loc=2); // To see which points are there, please use: // identify(X,Y,1) // Click middle button to leave. // Draw Parity Plot [B,bint,r,rint,stats,fullstats] = regress(Y,model); scf(); title(msprintf("Regression - R2=%.2f%%",fullstats.R2*100)) plot(Y,model*B,"bo") plot(Y,Y,"r-") xlabel("Observations") ylabel("Predictions") | ![]() | ![]() |
[1] "Introduction to probability and statistics for engineers and scientists.", Third Edition, Sheldon Ross, Elsevier Academic Press, 2004
[2] http://en.wikipedia.org/wiki/Linear_regression
[3] Octave's regress, William Poetra Yoga Hadisoeseno
[4] http://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Longley.dat