7  Regression

PDF version

7.1 Best predictor property

It turns out that the CEF m(\boldsymbol Z) = E[Y | \boldsymbol Z] is the best predictor for Y given the information contained in the random vector \boldsymbol Z:

Best predictor

The CEF m(\boldsymbol Z) = E[Y | \boldsymbol Z] minimizes the expected squared error E[(Y-g(\boldsymbol Z))^2] among all predictor functions g(\boldsymbol Z): m(\boldsymbol Z) = \argmin_{g(\boldsymbol Z)} E[(Y-g(\boldsymbol Z))^2]


Proof: Let us find the function g(\cdot) that minimizes E[(Y-g(\boldsymbol Z))^2]:

\begin{align*} &E[(Y-g(\boldsymbol Z))^2] = E[(Y-m(\boldsymbol Z) + m(\boldsymbol Z) - g(\boldsymbol Z))^2] \\ &= \underbrace{E[(Y-m(\boldsymbol Z))^2]}_{=(i)} + 2\underbrace{E[(Y-m(\boldsymbol Z))(m(\boldsymbol Z) - g(\boldsymbol Z))]}_{=(ii)} + \underbrace{E[(m(\boldsymbol Z) - g(\boldsymbol Z))^2]}_{(iii)} \end{align*}

  • The first term (i) does not depend on g(\cdot) and is finite if E[Y^2] < \infty.
  • For the second term (ii), we use the LIE and CT: \begin{align*} &E[(Y-m(\boldsymbol Z))(m(\boldsymbol Z) - g(\boldsymbol Z))] \\ &= E[E[(Y-m(\boldsymbol Z))(m(\boldsymbol Z) - g(\boldsymbol Z))| \boldsymbol Z]] \\ &= E[E[Y-m(\boldsymbol Z)| \boldsymbol Z](m(\boldsymbol Z) - g(\boldsymbol Z))] \\ &= E[(\underbrace{E[Y| \boldsymbol Z]}_{=m(\boldsymbol Z)} - m(\boldsymbol Z))(m(\boldsymbol Z) - g(\boldsymbol Z))] = 0 \end{align*}
  • The third term (iii) E[(m(\boldsymbol Z) - g(\boldsymbol Z))^2] is minimal if g(\cdot) = m(\cdot).

Therefore, m(\boldsymbol Z) = E[Y| \boldsymbol Z] minimizes E[(Y-g(\boldsymbol Z))^2].


The best predictor for Y given \boldsymbol Z is m(\boldsymbol Z)= E[Y | \boldsymbol Z], but Y can typically only partially be predicted. We have a prediction error (CEF error) u = Y - E[Y | \boldsymbol Z]. The conditional expectation of the CEF error does not depend on X and is zero: \begin{align*} E[u | \boldsymbol Z] &= E[(Y - m(\boldsymbol Z)) | \boldsymbol Z] \\ &= E[Y | \boldsymbol Z] - E[m(\boldsymbol Z) | \boldsymbol Z] \\ &= m(\boldsymbol Z) - m(\boldsymbol Z) = 0. \end{align*}

7.2 Population regression

Consider the dependent variable Y_i and the regressor vector \boldsymbol X_i = (1, X_{i2}, \ldots, X_{ik})' for a representative individual i from the population. Assume the linear relationship: Y_i = \boldsymbol X_i' \boldsymbol \beta + u_i, where \boldsymbol \beta is the vector of population regression coefficients, and u_i is an error term satisfying E[\boldsymbol X_i u_i] = \boldsymbol 0.

The error term u_i accounts for factors affecting Y_i that are not included in the model, such as measurement errors, omitted variables, or unobserved/unmeasured variables. We assume all variables have finite second moments, ensuring that all covariances and cross-moments are finite.

To express \boldsymbol \beta in terms of population moments, compute: \begin{align*} E[\boldsymbol X_i Y_i] &= E[\boldsymbol X_i (\boldsymbol X_i' \beta + u_i)] \\ &= E[\boldsymbol X_i \boldsymbol X_i'] \boldsymbol \beta + E[\boldsymbol X_i u_i]. \end{align*}

Since E[\boldsymbol X_i u_i] = \boldsymbol 0, it follows that E[\boldsymbol X_i Y_i] = E[\boldsymbol X_i \boldsymbol X_i'] \boldsymbol \beta. Assuming E[\boldsymbol X_i \boldsymbol X_i'] is invertible, we solve for \boldsymbol \beta: \boldsymbol \beta = E[\boldsymbol X_i \boldsymbol X_i']^{-1} E[\boldsymbol X_i Y_i]. Applying the method of moments, we estimate \boldsymbol \beta by replacing the population moments with their sample counterparts: \widehat{ \boldsymbol \beta} = \bigg( \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \bigg)^{-1} \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i Y_i

This estimator \widehat{ \boldsymbol \beta} coincides with the OLS coefficient vector and is known as the OLS estimator or the method of moments estimator for \boldsymbol \beta.

7.3 Linear regression model

Consider again the linear regression framework with dependent variable Y_i and regressor vector \boldsymbol X_i. The previous section shows that we can always write Y_i = m(\boldsymbol X_i) + u_i, \quad E[u_i | \boldsymbol X_i] = 0, where m(\boldsymbol X_i) is the CEF of Y_i given \boldsymbol X_i, and u_i is the CEF error.

In the linear regression model, we assume that the CEF is linear in \boldsymbol X_i, i.e. Y_i = \boldsymbol X_i'\boldsymbol \beta + u_i, \quad E[u_i | \boldsymbol X_i] = 0. From this equation, by the CT, it becomes clear that E[Y_i | \boldsymbol X_i] = E[\boldsymbol X_i'\boldsymbol \beta + u_i| \boldsymbol X_i] = \boldsymbol X_i'\boldsymbol \beta + E[u_i | \boldsymbol X_i] = \boldsymbol X_i'\boldsymbol \beta. Therefore, \boldsymbol X_i'\boldsymbol \beta is the best predictor for Y_i given \boldsymbol X_i.

Linear regression model

We assume that (Y_i, \boldsymbol X_i') satisfies Y_i = \boldsymbol X_i'\boldsymbol \beta + u_i, \quad i=1, \ldots, n, \tag{7.1} with

  • (A1) conditional mean independence: E[u_i | \boldsymbol X_i] = 0

  • (A2) random sampling: (Y_i, \boldsymbol X_i') are i.i.d. draws from their joint population distribution

  • (A3) large outliers unlikely: 0 < E[Y_i^4] < \infty, 0 < E[X_{il}^4] < \infty for all l=1, \ldots, k

  • (A4) no perfect multicollinearity: \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' is invertible

In matrix notation, the model equation can be written as \boldsymbol Y = \boldsymbol X \boldsymbol \beta + \boldsymbol u, where \boldsymbol u = (u_1, \ldots, u_n)' is the error term vector, \boldsymbol Y is the dependent variable vector, and \boldsymbol X is the n \times k regressor matrix.

(A1) and (A2) define the structure of the regression model, while (A3) and (A4) ensure that OLS estimation is feasible and reliable.

7.3.1 Conditional mean independence (A1)

Assumption (A1) is fundamental to the regression model and has several key implications:

1) Zero unconditional mean

Using the Law of Iterated Expectations (LIE): E[u_i] \overset{\tiny (LIE)}{=} E[E[u_i | \boldsymbol X_i]] = E[0] = 0

The error term u_i has a zero unconditional mean.

2) Linear best predictor

The conditional mean of Y_i given X_{i} is: \begin{align*} E[Y_i | \boldsymbol X_i] &= E[\boldsymbol X_i'\boldsymbol \beta + u_i | \boldsymbol X_i] \\ &\overset{\tiny (CT)}{=} \boldsymbol X_i'\boldsymbol \beta + E[u_i | \boldsymbol X_i] \\ &= \boldsymbol X_i'\boldsymbol \beta \end{align*}

The regression function \boldsymbol X_i'\boldsymbol \beta represents the best linear predictor of Y_i given \boldsymbol X_i. This means the expected value of Y_i is a linear function of the regressors.

3) Marginal effect interpretation

From the linearity of the conditional expectation: E[Y_i | \boldsymbol X_i] = \boldsymbol X_i'\boldsymbol \beta = \beta_1 + \beta_2 X_{i2} + \ldots + \beta_k X_{ik}. The partial derivative with respect to X_{ij} is: \frac{\text{d} E[Y_i | \boldsymbol X_i]}{ \text{d} X_{ij}} = \beta_j

The coefficient \beta_j represents the marginal effect of a one-unit increase in X_{ij} on the expected value of Y_i, holding all other variables constant.

Note: This marginal effect is not necessarily causal. Unobserved factors correlated with X_{ij} may influence Y_i, so \beta_j captures both the direct effect of X_{ij} and the indirect effect through these unobserved variables.

4) Weak exogeneity

Using the definition of covariance: Cov(u_i, X_{il}) = E[u_i X_{il}] - E[u_i] E[X_{il}]. Since E[u_i] = 0: Cov(u_i, X_{il}) = E[u_i X_{il}]. Applying the LIE and the CT: \begin{align*} E[u_i X_{il}] &= E[E[u_i X_{il}|\boldsymbol X_i]] \\ &= E[X_{il} E[u_i|\boldsymbol X_i]] \\ &= E[X_{il} \cdot 0] = 0 \end{align*}

The error term u_i is uncorrelated with each regressor X_{il}. This property is known as weak exogeneity. It indicates that u_ii captures unobserved factors that do not systematically vary with the observed regressors.

Note: Weak exogeneity does not rule out the presence of unobserved variables that affect both Y_i and \boldsymbol X_i. The coefficient \beta_j reflects the average relationship between \boldsymbol X_i and Y_i, including any indirect effects from unobserved factors that are correlated with \boldsymbol X_i.

7.3.2 Random sampling (A2)

1) Strict exogeneity

The i.i.d. assumption (A2) implies that \{(Y_i, \boldsymbol X_i',u_i), i=1,\ldots,n\} is an i.i.d. collection since u_i = Y_i - \boldsymbol X_i'\boldsymbol \beta is a function of a random sample, and functions of independent variables are independent as well.

Therefore, u_i and \boldsymbol X_j are independent for i \neq j. The independence rule (IR) implies E[u_i | \boldsymbol X_1, \ldots, \boldsymbol X_n] = E[u_i | \boldsymbol X_i].

The weak exogeneity condition (A1) turns into a strict exogeneity property: E[u_i | \boldsymbol X] = E[u_i | \boldsymbol X_1, \ldots, \boldsymbol X_n] \overset{\tiny (A2)}{=} E[u_i | \boldsymbol X_i] \overset{\tiny (A1)}{=} 0. Additionally, Cov(u_j, X_{il}) = \underbrace{E[u_j X_{il}]}_{=0} - \underbrace{E[u_j]}_{=0} E[X_{il}] = 0.

Weak exogeneity means that the regressors of individual i are uncorrelated with the error term of the same individual i. Strict exogeneity means that the regressors of individual i are uncorrelated with the error terms of any individual j in the sample.

2) Heteroskedasticity

The i.i.d. assumption (A2) is not as restrictive as it may seem at first sight. It allows for dependence between u_i and \boldsymbol X_i = (1,X_{i2},\dots,X_{ik})'. The error term u_i can have a conditional distribution that depends on \boldsymbol X_i.

The exogeneity assumption (A1) requires that the conditional mean of u_i is independent of \boldsymbol X_i. Besides this, dependencies between u_i and X_{i2},\dots,X_{ik} are allowed. For instance, the variance of u_i can be a function of X_{i2},\dots,X_{ik}. If this is the case, u_i is said to be heteroskedastic.

The conditional variance is defined analogously to the unconditional variance: Var[Y|\boldsymbol Z] = E[(Y-E[\boldsymbol Y|\boldsymbol Z])^2|\boldsymbol Z] = E[Y^2|\boldsymbol Z] - E[Y|\boldsymbol Z]^2.

The conditional variance of the error is: Var[u_i | \boldsymbol X] = E[u_i^2 | \boldsymbol X] \overset{\tiny (A2)}{=} E[u_i^2 | \boldsymbol X_i] =: \sigma_i^2 = \sigma^2(\boldsymbol X_i). An additional restrictive assumption is homoskedasticity, which means that the variance of u_i is not allowed to vary for different values of \boldsymbol X_i: Var[u_i | \boldsymbol X] = \sigma^2. Homoskedastic errors are a restrictive assumption sometimes made for convenience in addition to (A1)+(A2). Homoskedasticity is often unrealistic in practice, so we stick with the heteroskedastic errors framework.

3) No autocorrelation

(A2) implies that u_i is independent of u_j for i \neq j, and therefore E[u_i | u_j, \boldsymbol X] = E[u_i | \boldsymbol X] = 0 by the IR. This implies

E[u_i u_j | \boldsymbol X] \overset{\tiny (LIE)}{=} E\big[E[u_i u_j | u_j, \boldsymbol X] | \boldsymbol X\big] \overset{\tiny (CT)}{=} E\big[u_j \underbrace{E[u_i | u_j, \boldsymbol X]}_{=0} | \boldsymbol X\big] = 0, and, therefore, Cov(u_i, u_j) = E[u_i u_j] \overset{\tiny (LIE)}{=} E[E[u_i u_j | \boldsymbol X]] = 0.

The conditional covariance matrix of the error term vector \boldsymbol u is \boldsymbol D := Var[\boldsymbol u | \boldsymbol X] = E[\boldsymbol u \boldsymbol u' | \boldsymbol X] = \begin{pmatrix} \sigma_1^2 & 0 & \ldots & 0 \\ 0 & \sigma_2^2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma_n^2 \end{pmatrix}.

It is a diagonal matrix with conditional variances on the main diagonal. We also write \boldsymbol D = diag(\sigma_1^2, \ldots, \sigma_n^2).

7.3.3 Finite moments and invertibility (A3 + A4)

Assuming (A3) excludes frequently occurring large outliers as it rules out heavy-tailed distributions. Hence, we should be careful if we use variables with large kurtosis. Assuming (A4) ensures that the OLS estimator \widehat{\boldsymbol \beta} can be computed.

7.3.3.1 Unbiasedness

(A4) ensures that \widehat{\boldsymbol \beta} is well defined. The following decomposition is useful: \begin{align*} \widehat{\boldsymbol \beta} &= (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol Y \\ &= (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' (\boldsymbol X \boldsymbol \beta + \boldsymbol u) \\ &= (\boldsymbol X' \boldsymbol X)^{-1} (\boldsymbol X' \boldsymbol X) \boldsymbol \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u \\ &= \boldsymbol \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u. \end{align*} The strict exogeneity implies E[\boldsymbol u | \boldsymbol X] = \boldsymbol 0, and E[\widehat{ \boldsymbol \beta} - \boldsymbol \beta | \boldsymbol X] = E[(\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \overset{\tiny (CT)}{=} (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \underbrace{E[\boldsymbol u | \boldsymbol X]}_{= \boldsymbol 0} = \boldsymbol 0. By the (LIE), E[\widehat{ \boldsymbol \beta}] = E[E[\widehat{ \boldsymbol \beta} | \boldsymbol X]] = E[\boldsymbol \beta] = \boldsymbol \beta.

Hence, the OLS estimator is unbiased: Bias[\widehat{\boldsymbol \beta}] = 0.

7.3.3.2 Conditional variance

Recall the matrix rule Var[\boldsymbol A \boldsymbol Z] = \boldsymbol A Var[\boldsymbol Z] \boldsymbol A' if \boldsymbol Z is a random vector and \boldsymbol A is a matrix. Then, \begin{align*} Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] &= Var[\boldsymbol \beta + (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \\ &= Var[(\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \\ &= (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' Var[\boldsymbol u | \boldsymbol X] ((\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X')' \\ &= (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol D \boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1}. \end{align*}

7.3.3.3 Consistency

The conditional variance can be written as \begin{align*} Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] &= \frac{1}{n} \Big( \frac{1}{n} \boldsymbol X' \boldsymbol X \Big)^{-1} \Big(\frac{1}{n} \boldsymbol X' \boldsymbol D \boldsymbol X \Big) \Big( \frac{1}{n} \boldsymbol X' \boldsymbol X \Big)^{-1} \\ &= \frac{1}{n} \Big( \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \Big)^{-1} \Big(\frac{1}{n} \sum_{i=1}^n \sigma_i^2 \boldsymbol X_i \boldsymbol X_i' \Big) \Big( \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \Big)^{-1} \end{align*}

It can be shown, by the multivariate law of large numbers, that \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \overset{p}{\rightarrow} E[\boldsymbol X_i \boldsymbol X_i'] and \frac{1}{n} \sum_{i=1}^n \sigma_i^2 \boldsymbol X_i \boldsymbol X_i \overset{p}{\rightarrow} E[\sigma_i^2 \boldsymbol X_i \boldsymbol X_i']. For this to hold we need bounded fourth moments, i.e. (A3). In total, we have \begin{align*} &\Big( \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \Big)^{-1} \Big(\frac{1}{n} \sum_{i=1}^n \sigma_i^2 \boldsymbol X_i \boldsymbol X_i' \Big) \Big( \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' \Big)^{-1} \\ &\overset{p}{\rightarrow} E[\boldsymbol X_i \boldsymbol X_i']^{-1} E[\sigma_i^2 \boldsymbol X_i \boldsymbol X_i'] E[\boldsymbol X_i \boldsymbol X_i']^{-1}. \end{align*} Note that the conditional variance Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] has an additional factor 1/n, which converges to zero for large n. Therefore, we have Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] \overset{p}{\rightarrow} \boldsymbol 0, which also holds for the unconditional variance, i.e. Var[\widehat{ \boldsymbol \beta}] \to \boldsymbol 0.

Therefore, since the bias is zero and the variance converges to zero, the sufficient conditions for consistency are fulfilled. The OLS estimator \widehat{\boldsymbol \beta} is a consistent estimator for \boldsymbol \beta under (A1)–(A4).

7.4 R-codes

statistics-sec06.R