PS03

Due date: Tuesday, 11:59p.

Instructions

Complete the following questions below and show all work. You may either type or hand write your answers. However, you must submit your problem set to Canvas as an html or pdf. Meaning any handwritten solutions are to be scanned and uploaded. The onus is yours to deliver an organized, clear, and/or legible submission of the correct file type that loads properly on Canvas.1 Double-check your submissions!

Integrity: If you are suspected of cheating, you will receive a zero—for the assignment and possibly for the course. Cheating includes copying work from your classmates, from the internet, and from previous problem sets. You are encouraged to work with your peers, but everyone must submit their own answers. Remember, the problem sets are designed to help you study for the midterm. By cheating, you do yourself a disservice.

Questions

Q01. Goodness of fit

Recall the following definitions:

\[ TSS = \sum_{i=1}^n (Y_i - \bar{Y})^2 \]

\[ ESS = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2 \]

\[ RSS = \sum_{i=1}^n \hat{u}_i^2 \]

a. In words, what do \(TSS\), \(ESS\), and \(RSS\) describe?

Solution. TSS measures the total variation in the dependent variable. ESS represents the portion of the total variability that can be attributed to the model, capturing the difference between the predicted values and the mean of the dependent variable. RSS measures the unexplained variation, which is the difference between the actual values and the predicted values from the model.

b. Show that \(TSS = ESS + RSS\).

Hint: The following property of the residuals may be useful: \(\sum_{i=1}^n \hat{u}_i = 0\).

Solution. \[\begin{align*} \text{TSS} &= \sum_{i=1}^n (Y_i - \bar{Y})^2 \\ &= \sum_{i=1}^n ([\hat{Y_i} + \hat{u}_i] - [\bar{\hat{Y}} + \bar{\hat{u}}])^2 \\ &= \sum_{i=1}^n \left( [\hat{Y_i} - \bar{Y}] + \hat{u}_i \right)^2 \quad \quad \text{Since $\bar{\hat{u}}=0$ which implies $\bar{Y} = \bar{\hat{Y}}$} \\ &= \sum_{i=1}^n (\hat{Y_i} - \bar{Y})^2 + \sum_{i=1}^n \hat{u}_i^2 + 2 \sum_{i=1}^n \left( (\hat{Y_i} - \bar{Y})\hat{u}_i \right) \\ &= \text{ESS} + \text{RSS} + 2 \sum_{i=1}^n \hat{Y_i}\hat{u}_i - 2 \bar{Y}\sum_{i=1}^n \hat{u}_i \end{align*}\] Now we must show that \(2 \sum_{i=1}^n \hat{Y_i}\hat{u}_i - 2 \bar{Y}\sum_{i=1}^n \hat{u}_i = 0\). We must do so by using the following properties of OLS, \(\sum_{i=1}^n \hat{u}_i=0\) and \(\sum_{i=1}^n X_i \hat{u}_i=0\). Thus, notice that by the first property, \(- 2 \bar{Y}\sum_{i=1}^n \hat{u}_i=0\) since \(\sum_{i=1}^n \hat{u}_i=0\). Now substitute for \(\hat{Y}_i=\hat{\beta}_0 + \hat{\beta}_1X_i\). \[\begin{align*} 2 \sum_{i=1}^n \hat{Y_i}\hat{u}_i - 2 \bar{Y}\sum_{i=1}^n \hat{u}_i &= 2 \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1X_i)\hat{u}_i \\ &= 2 \hat{\beta}_0 \sum_{i=1}^n \hat{u}_i + 2 \hat{\beta}_1 \sum_{i=1}^n X_i\hat{u}_i = 0 \end{align*}\]

Q02. Wage Regression

Consider a dataset obtained from a labor economics study that investigates the impact of years of education on individual’s wages. The dataset includes a random sample of workers in a specific region. The following regression equation estimates the relationship between wages (measured in thousands of dollars) and years of education:

\[ \text{Wage}_i = \beta_0 + \beta_1 \times \text{Education}_i + u_i \]

From the regression output, you have the following estimates:

\[ \text{Wage} = 10 + 2 \times \text{Education} \]

  • Standard error of \(\beta_1\), \(\sigma_{\beta_1}\) = 0.5
  • \(R^2 = 0.12\)
  • Number of observations, \(n = 150\)

a. Interpret the Estimates: Interpret the intercept and slope coefficients in the context of the model.

Solution. The intercept \(\beta_0 = 10\) represents the predicted wage (in thousands of dollars) for a person with zero years of education. The slope \(\beta_1 = 2\) suggests that each additional year of education is associated with an increase in wage by $2000.

b. Predicted Outcome: If an individual has 12 years of education, what is the predicted wage according to the model?

Solution. Plugging \(X = 12\) into the regression equation, the predicted wage is \(10 + 2 \times 12 = 34\) thousand dollars.

c. Effect of Changing \(X\): Suppose an individual worker is deciding whether or not to complete her associates degree (two years). What would the model predict her change in wage would be? In other words, what is her expected increase in wage if she completes her associates degree?

Solution. The predicted change in wage is \(2 \times 2 = 4\) thousand dollars, reflecting the increase for 2 additional years of education.

d. Hypothesis Test: Conduct a simple hypothesis test to determine if there is a statistically significant relationship between education and wage. State the null and alternative hypothesis and calculate the t-statistic to determine your conclusion. The critical value at the 5% significance level is approximately 1.98. Use this critical value to test the significance of the slope coefficient.

Solution. The test statistic is calculated as \(t = \frac{\beta_1}{SE(\beta_1)} = \frac{2}{0.5} = 4\). Compared to a critical t-value from a t-distribution table (with 148 degrees of freedom), this value is significant, hence we reject the null hypothesis.

e. Confidence Intervals: Calculate the 95% confidence interval for \(\beta_1\). Does your confidence interval satisfy the hypothesis test from part d?

Solution. The 95% confidence interval for \(\beta_1\) is given by \(2 \pm 1.96 \times 0.5 = [1, 3]\).

f. Interpret the \(R^2\): Explain what the \(R^2\) value tells us about the model’s fit to the data.

Solution. The \(R^2\) value of 0.40 indicates that 40% of the variation in wages is explained by the number of years of education, suggesting a moderate fit of the model to the data.

Q03. \(R^2\)

a. What is \(R^2\)?

Solution. The coefficient of determination, \(R^2\), is a measure of the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a measure of the goodness of fit of the model.

b. Why is \(R^2\) bounded between 0 and 1? Explain using the following equations:

\[ \begin{align*} R^2 &= \frac{ESS}{TSS} \\ R^2 &= 1 - \frac{RSS}{TSS} \end{align*} \]

Solution. The \(R^2\) value is bounded between 0 and 1 because it is a ratio of two sums of squares. The \(ESS\) and \(RSS\) are both non-negative, and the \(TSS\) is always positive. Therefore, the \(R^2\) value will always be between 0 and 1.

Q04. Gauss-Markov Theorem

a. What does unbiasedness mean? What does efficiency refer to? Feel free to make a visualization to answer this question.

Solution. Unbiasedness refers to the property of an estimator that it is centered around the true population parameter. Efficiency refers to the property of an estimator that it has the smallest variance among all unbiased estimators.

b. What does the Gauss-Markov theorem say about unbiasedness and efficiency of the OLS estimator?

Solution. The Gauss-Markov theorem states that the OLS estimator is the best linear unbiased estimator (BLUE) of the population parameter. This means that the OLS estimator is unbiased and has the smallest variance among all unbiased estimators.

c. In the context of the wage equation, what must we assume to be true regarding the error term \(u_i\) for the OLS estimator to be unbiased. Specifically, I am interested in the third assumption of the classical linear regression model. Give an example of a violation of this assumption in the context of the wage equation.

Solution. The third assumption of the classical linear regression model states that the error term \(u_i\) has a mean of zero conditional on the independent variables. This means that the expected value of the error term is zero given the independent variables.

An example of a violation of this assumption in the context of the wage equation would be if the error term \(u_i\) had a mean that was not zero conditional on the independent variables. For example, if the error term \(u_i\) had a mean of 1, then the OLS estimator would be biased.

Footnotes

  1. Do not simply change the file type and submit (eg edit name of document from .jpg \(\rightarrow\) .pdf)↩︎