<< criterionBayesian Regression polymultiindex >>

stixbox >> stixbox > Regression > lsselect

lsselect

Select a predictor subset for regression

Calling Sequence

[Q, I, B, BB,R2] = lsselect(y,x,crit,how,pmax,level)

Parameters

y :

a m-by-1 matrix of doubles, dependent variate (column vector)

x :

a m-by-n matrix of doubles, regressor variates

crit :

selection criterion (string): 'HT' : Hypothesis Test (default level = 0.05), 'AIC' : Akaike's Information Criterion, 'BIC': Bayesian Information Criterion, 'CMV' : Cross Model Validation (inner criterion RSS) (default = 'CMV')

how :

(string) choses between :'AS' : All Subsets, 'FI' : Forward Inclusion, 'BE' : Backward Elimination (default = 'FI')

pmax :

(scalar) limits the number of included parameters (default pmax=n).

level :

optional input argument, p-value reference used for inclusion or deletion.

Q :

criterion as a function of the number of parameters; might be interpreted as an estimate of the prediction standard deviation. For the method 'HT', Q is instead the successive p-values for inclusion or elimination.

I :

a 1-by-n matrix of doubles, index numbers of the included columns, i.e. the selected variables.

B :

a n-by-1 matrix of doubles, vector of coefficients, ie the suggested model is Y = X*B.

BB :

a n-by-pmax matrix of doubles, Column p of BB is the best B of parameter size p.

R2 :

a double, the R-squared statistics

Description

Selects a good subset of regressors in a multiple linear regression model. The criterion is one of the following ones, determined by the third argument, crit.

The fifth argument, pmax, limits the number of included parameters. The returned Q is the criterion as a function of the number of parameters; it might be interpreted as an estimate of the prediction standard deviation. For the method 'HT' the reported Q is instead the successive p-values for inclusion or elimination.

The last column of the prediction matrix x must be an intercept column, ie all elements are ones. This column is never excluded in the search for a good model. If it is not present it is added. The output I contains the index numbers of the included columns. For the method 'HT' the optional input argument level is the p-value reference used for inclusion or deletion. Output B is the vector of coefficients, ie the suggested model is Y = X*B. Column p of BB is the best B of parameter size p.

This function is not highly optimized for speed but rather for flexibility. It would be faster if 'all subsets' were in a separate routine and 'forward' and 'backward' were in another routine, especially for CMV.

Examples

// Longley.dat contains 1 Response Variable y,
// 6 Predictor Variables x and 16 Observations.
// Source : [4]
[data,txt] = getdata(24);
Y=data(:,1);
X=data(:,2:7);
// Add a column of 1s at then end
nobs=size(X,"r");
X = [X,ones(nobs,1)];
[Q,I,B,BB,R2]=lsselect(Y,X);
// Draw parity plot
scf();
title(msprintf("Regression with selection - R2=%.2f%%",R2*100))
plot(Y,X*B,"bo")
plot(Y,Y,"r-")
xlabel("Observations")
ylabel("Predictions")

See also

Authors


Report an issue
<< criterionBayesian Regression polymultiindex >>