Regression Analysis with Matlab
Introduction
In this note we demonstrate the use of matrices for regression analysis. The multivariate regression model is:
Thus, the outcome of observation i, , is a linear function of regressors , plus an error term . Let be some estimators of the unknown coefficients . We define the fitted eqaution, which gives the predicted values of the outcome for given estimates: The residual of observation i is . The OLS (Ordinary Least Square) estimator of is the vector which minimizes the sum of squared residuals, i.e. solves: The first order necessary conditions for this this quadratic optimization problem is a system of k linear equations with k unknowns, . Statistics and Machine Learning Toolbox
Matlab's Statistics and Machine Learning Toolbox can estimate the multivariate linear regression models using the command fitlm, which is similar to the lm() in R.
Reading the data from the web
wage = webread("https://dl.dropboxusercontent.com/s/32m1t3iyno319j6/wage21.csv?dl=0");
Estimating the model
The next command generates output very similar to R.
m = fitlm(wage,'EARNINGS ~ S + EXP');
m
m =
Linear regression model:
EARNINGS ~ 1 + S + EXP
Estimated Coefficients:
Estimate SE tStat pValue
________ _______ _______ __________
(Intercept) -40.984 7.4496 -5.5015 5.837e-08
S 4.1828 0.39843 10.498 1.4122e-23
EXP 0.73835 0.2243 3.2919 0.0010607
Number of observations: 540, Error degrees of freedom: 537
Root Mean Squared Error: 22.8
R-squared: 0.171, Adjusted R-Squared: 0.168
F-statistic vs. constant model: 55.5, p-value = 1.21e-22
Matrix Algebra
In this section we demonstrate what statistical packages are doing when they are estimating the multivariate regression model.
OLS problem in matrix notation
The model in the introduction can be presented in matrix form:
the model becomes
The fitted equation and residuals of vecotr of estimates are The OLS problem in matrix form is therefore:
The solution is
Solving the OLS problem with matrix algebra
We prepare the vector of dependent variable Y and the design matrix X, containing column of 1s and the regressors . Y = wage(:,{'EARNINGS'}); %Extracting the dependent variable
X = wage(:,{'S','EXP'}); %Extracting the needed regressors
Y = table2array(Y); %Converting to matrix
X = table2array(X); %Converting to matrix
X = [ones(length(X),1), X]; %Adding vector of 1s for X
Next we use the analytical solution for OLS coefficients, . b = (X'*X)\X'*Y %Gaussian elimination (faster, more accurate)
b_inv = inv(X'*X)*X'*Y %Matrix inversion
Matrix algebra is used to obtain elegan formulas for decomposition of sum of squares:
In matrix form, these become:
The estimated covariance matrix of OLS estimators:
where k is the number of number of estimated coefficients.
Thus, the standard errors of OLS estimators is the the square root of the diagonal of the covariance matrix.
Implementing the above in Matlab is straightforward.
Y_hat = X*b; %Fitted (predicted) values
e = Y - Y_hat; %Residuals
n = size(X,1); %Number of observations
k = size(X,2); %Number of regressors
C = (X'*X)\eye(k)*(e'*e)/(n-k); % cov(b)
se = sqrt(diag(C)) %Standard errors of b
TSS = Y'*Y - n*mean(Y)^2; %Total Sum of Squares
ESS = b'*(X'*X)*b - n*mean(Y)^2; %Explained Sum of Squares
RSS = e'*e; %Residuals Sum of Squares
R2 = ESS/TSS; %Ordinary R-squared
R2_adj = 1 - (1-R2)*(n-1)/(n-k); %Adjusted R-squared
disp(['TSS = ', num2str(TSS)])
disp(['ESS = ', num2str(ESS)])
disp(['RSS = ', num2str(RSS)])
disp(['R^2 = ', num2str(R2)])
disp(['R^2_adj = ', num2str(R2_adj)])
Comparing our results to the ones obtained by Matlab's fitlm() function:
disp(['TSS = ', num2str(m.SST)]) %Total Sum of Squares
disp(['ESS = ', num2str(m.SSR)]) %Explained Sum of Squares
disp(['RSS = ', num2str(m.SSE)]) %Residuals Sum of Squares
disp('R^2 = ') %R-squared (ordinary and adjusted)
disp(m.Rsquared)
Ordinary: 0.1713
Adjusted: 0.1683