Select a predictor subset for regression
[Q, I, B, BB,R2] = lsselect(y,x,crit,how,pmax,level)
a m-by-1 matrix of doubles, dependent variate (column vector)
a m-by-n matrix of doubles, regressor variates
selection criterion (string): 'HT' : Hypothesis Test (default level = 0.05), 'AIC' : Akaike's Information Criterion, 'BIC': Bayesian Information Criterion, 'CMV' : Cross Model Validation (inner criterion RSS) (default = 'CMV')
(string) choses between :'AS' : All Subsets, 'FI' : Forward Inclusion, 'BE' : Backward Elimination (default = 'FI')
(scalar) limits the number of included parameters (default pmax=n).
optional input argument, p-value reference used for inclusion or deletion.
criterion as a function of the number of parameters; might be interpreted as an estimate of the prediction standard deviation. For the method 'HT', Q is instead the successive p-values for inclusion or elimination.
a 1-by-n matrix of doubles, index numbers of the included columns, i.e. the selected variables.
a n-by-1 matrix of doubles, vector of coefficients, ie the suggested model is Y = X*B.
a n-by-pmax matrix of doubles, Column p of BB is the best B of parameter size p.
a double, the R-squared statistics
Selects a good subset of regressors in a multiple linear regression model. The criterion is one of the following ones, determined by the third argument, crit.
The fifth argument, pmax, limits the number of included parameters. The returned Q is the criterion as a function of the number of parameters; it might be interpreted as an estimate of the prediction standard deviation. For the method 'HT' the reported Q is instead the successive p-values for inclusion or elimination.
The last column of the prediction matrix x must be an intercept column, ie all elements are ones. This column is never excluded in the search for a good model. If it is not present it is added. The output I contains the index numbers of the included columns. For the method 'HT' the optional input argument level is the p-value reference used for inclusion or deletion. Output B is the vector of coefficients, ie the suggested model is Y = X*B. Column p of BB is the best B of parameter size p.
This function is not highly optimized for speed but rather for flexibility. It would be faster if 'all subsets' were in a separate routine and 'forward' and 'backward' were in another routine, especially for CMV.
// Longley.dat contains 1 Response Variable y, // 6 Predictor Variables x and 16 Observations. // Source : [4] [data,txt] = getdata(24); Y=data(:,1); X=data(:,2:7); // Add a column of 1s at then end nobs=size(X,"r"); X = [X,ones(nobs,1)]; [Q,I,B,BB,R2]=lsselect(Y,X); // Draw parity plot scf(); title(msprintf("Regression with selection - R2=%.2f%%",R2*100)) plot(Y,X*B,"bo") plot(Y,Y,"r-") xlabel("Observations") ylabel("Predictions") | ![]() | ![]() |