Chapter 2 Simple linear regression

2.1 playbill

First we load the data.

playbill <- read.csv(file.path("data", "playbill.csv"))

Then we fit a linear model, \(Y=\beta_0 + \beta_1 + e\) and summarize it in Table 2.1.

pb_fit1 <- lm(CurrentWeek ~ LastWeek, data = playbill)
kable(summary(pb_fit1)$coef,
      booktabs = TRUE,
      caption = "Coefficients our linear model.")
Table 2.1: Coefficients our linear model.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6804.89 9929.32 0.69 0.5
LastWeek 0.98 0.01 68.07 0.0

a

The confidence intervals for \(\beta_1\) are given by

confint(pb_fit1)[2, ]
##  2.5 % 97.5 % 
##   0.95   1.01

As per the question, 1 seems like a plausible value given that returns are likely to be similar from one week to another (although exactly 1 is incredibly unlikely).

b

We proceed to test the hypotheses \[ \begin{gather} H_0:\beta_0 = 10000 \\ H_1:\beta_0 \neq 10000 \end{gather} \]

by running

h_0 <- 10000
h_obs <- coef(pb_fit1)[[1]]
h_obs_se <- summary(pb_fit1)$coef[1, 2]

tobs <- (h_obs - h_0) / h_obs_se
(pobs <- 2 * pt(abs(tobs), nrow(playbill) - 2, lower.tail = FALSE))
## [1] 0.75

which leads us to accept the null hypothesis, \(t(16) = -0.32\), \(p = 0.75\).

c

We make a prediction, including prediction interval, for a 400,000$ box office result in the previous week:

predict(pb_fit1, data.frame(LastWeek = 400000), interval = "prediction")
##     fit    lwr    upr
## 1 4e+05 359833 439442

A prediction of 450,000$ is not feasible, given it is far outside our 95% prediction interval.

d

This seems like an okay rule given the almost-perfect correlation from one week to another; however, looking at the residuals we see that there are at least three values that are predicted badly (Figure 2.1)

par(mfrow = c(2, 2))
plot(pb_fit1)
Residuals for our linear fit to the playbill data.

Figure 2.1: Residuals for our linear fit to the playbill data.

2.2 Indicators

indicators <- read.csv(file.path("data", "indicators.csv"))

We begin by fitting our linear model to the data (Table 2.2).

ind_fit1 <- lm(PriceChange ~ LoanPaymentsOverdue, data = indicators)
kable(summary(ind_fit1)$coef,
      booktabs = TRUE,
      caption = "Coefficients for our linear model to the indicators data set.")
Table 2.2: Coefficients for our linear model to the indicators data set.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5 3.3 1.4 0.19
LoanPaymentsOverdue -2.2 0.9 -2.5 0.02

a

The 95% confidence interval for the \(\beta_1\) estimate is, as before:

confint(ind_fit1)[2, ]
##  2.5 % 97.5 % 
##  -4.16  -0.33

There is reason to believe that there is a negative trend.

b

We now create a confidence interval for \(\text{E}[Y|X=4]\):

predict(ind_fit1, data.frame(LoanPaymentsOverdue = 4), interval = "confidence")
##    fit  lwr  upr
## 1 -4.5 -6.6 -2.3

0% is not a reasonable estimate for \(\text{E}[Y|X=4]\) since the 95% confidence limit is far below 0.

2.3 Invoices

a

We first find a 95% confidence level for \(\beta_0\) using the output printed in the book.

beta0 <- 0.6417099
beta0_se <- 0.122707
beta0_t <- 5.248
beta0_margin <- 1.96 * beta0_se
(beta0_95 <- c(beta0 - beta0_margin, beta0 + beta0_margin))
## [1] 0.40 0.88

Thus, the confidence limit it 0.4, 0.88

b

We have the two-sided hypotheses \[\begin{gathered} H_0: \beta_1 = 0.01\\ H_1: \beta_1 \neq 0.01. \end{gathered}\]

beta <- 0.01
beta_obs_se <- 0.0008184
beta_obs <- 0.0112916
tval <- (beta - beta_obs) / beta_obs_se
(pobs <- 2 * pt(abs(tval), 30 - 1, lower.tail = FALSE))
## [1] 0.13

We fail to reject the null hypothesis, \(t(29) = -1.58\), \(p=0.13\). We cannot say that the true average processing time is significantly different from 0.01 hours.

c

From the exercise description we have the expected value \[ \text{Time} = 0.6417099+\text{Invoices}\times 0.0112916 \] for the series. Next, we’ll predict the processing time for 130 invoices using the output given in the exercise.

beta0 <- 0.6417099
beta1 <- 0.0112916
rse <- 0.3298
n <- 30
df <- n - 2
rss <- rse^2 * df
mse <- rss / n 

time <- beta0 + beta1 * 130
err <- qt(0.975, 28) * sqrt(mse) * sqrt(1 + 1 / n) # since x0 = xbar
upr <- time + err
lwr <- time - err

Which results in a point estimate of 2.11, 95% CI: \([1.45, 2.77]\).

2.4 Straight-line regression through the origin

a

We shall show that \[ \hat{\beta} = \frac{\sum_{i=1}^n x_iy_i}{\sum_{i=1}^n x_i^2}. \] We have \[ Y_i = \beta x_i + e_i, \] which has the least-squares solution of \[ \text{RSS} = \sum_{i=1}^n (y_i-\hat{y}_i)^2 = \sum_{i=1}^n(y_i-\hat{\beta}x_i)^2 \] that is solved by setting its partial derivative with respect to \(\beta\) to 0, like so: \begin{align} \frac{\partial}{\partial \beta}\text{RSS} = -2 \sum_{i=1}^nx_i(y_i - \hat{\beta}x_i) & = 0 \iff \\ \sum_{i=1}^nx_iy_i - \hat{\beta} \sum_{i=1}^n x_i^2 & = 0 \iff \\ \hat{\beta} & = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^nx_i} \tag*{$\square$}. \end{align}

b

  1. \[ E(\hat{\beta} | X) = E\left( \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \right) = \frac{\sum_{i=1}^n x_i E(y_i)}{\sum_{i=1}^n x_i^2 } = \beta \frac{\sum_{i=1}^nx_i^2}{\sum_{i=1}^n x_i^2} = \beta \tag*{$\square$} \]
  2. \[ \text{Var}(\hat{\beta}|X) = \text{Var} \left( \frac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^n x_i^2}\right) = \frac{\sum_{i=1}^n x_i^2\sigma^2}{\left( \sum_{i=1}^n x_i^2 \right)^2} = \frac{\sigma^2}{\sum_{i=1}^n x_i^2} \tag*{$\square$} \]
  3. Follows from (1) and (2).

2.5 Multiple choices

  1. is correct. SSreg is larger in model 1 because the model explains more of the variance, whilst RSS is smaller since there is less variance left unexplained.

2.6 SST = SSreg + RSS

a

\[ y_i - \hat{y}_i = (y_i - \bar{y}) - (\hat{y}_i - \bar{y}) = (y_i - \bar{y}) - (\hat{\beta}x_i - \hat{\beta}\bar{x}) = (y_i - \bar{y}) - \hat{\beta}(x_i - \bar{x}) \tag*{$\square$} \]

b

\[ \hat{y}_i - \bar{y} = \hat{\beta}x_i - \hat{\beta}\bar{x} = \hat{\beta}(x_i - \bar{x}) \tag*{$\square$} \]

c

\[ \begin{gathered} \sum_{i=1}^n (y_i-\hat{y}_i)(\hat{y}_i - \bar{y}) = \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1}x_i)\hat{\beta}_1(x_i - \bar{x}) = \\ \hat{\beta}_1 \left( \sum_{i=1}^n y_i(x_i \bar{x}) - \hat{\beta}_0 \sum_{i=1}^n x_i - \bar{x} - \hat{\beta}_1 \sum_{i=1}^n x_i(x_i - \bar{x}) \right) = \\ \hat{\beta_1}(\text{SXY} - 0 - \hat{\beta_1} \text{SXX}) = \\ \hat{\beta_1}(SXY - SXY) = 0 \qquad \square \end{gathered} \]

2.7 Confidence intervals

It is possible since the confidence interval is designed to contain the population regression line 0.95 of the time and not 95% of the observations – which is what the prediction interval is meant to do.