# Regression Analysis -ISYE-6414-0AN/Q/AsY Midterm Exam 1.Open Book Section

## 选择题

### Question 1

In simple linear regression, a violation of the mean zero assumption will not lead to difficulties estimating $\beta_0$.

A. True

B. False✅

### Question 2

Multiple linear regression captures the causation of a predicting variable to the response variable, conditional of other predicting variables in the model.

A. True

B. False✅

### Question 3

In Linear Regression, Ordinary Least Squares (OLS) can be considered a closed-form estimation approach that minimizes the sum of squared errors.

A. True✅

B. False

### Question 4

In multiple linear regression with an intercept, the sampling distribution of $\hat{\sigma}^2$ follows a T-distribution with $n - p$ degree of freedom, where $n$ is the number of observations and $p$ is the number of model predictors.

**Note**: The number of model predictors ($p$) does not include the intercept.

A. True

B. False✅

### Question 5

R-squared is the measure used to evaluate goodness of fit of a simple linear regression model.

A. True✅

B. False

### Question 6

If the constant variance assumption does not hold in simple linear regression, we typically apply a transformation to the predicting variables.

A. True

B. False✅

### Question 7

The F-test in ANOVA for equal means compares the between-group variability to the within-group variability.

A. True✅

B. False

### Question 8

In linear regression, goodness of fit describes how accurately a model fits the observed data by minimizing its residuals.

A. True✅

B. False

### Question 9

In ANOVA, the degrees of freedom for the pooled variance estimator is calculated as $n - 1$, where n is the total number of observations.

A. True

B. False✅

### Question 10

In ANOVA, if one confidence interval of a pair of means in the pairwise comparison does not include zero, we conclude that the two means of the pair are plausibly equal.

A. True

B. False✅

### Question 11

In a linear regression model without an intercept, we will include 5 dummy variables in the model if we have only 1 predicting variable in the model and that predicting variable is a qualitative variable with 5 categories.

A. True✅

B. False

### Question 12

The Mean Squared Prediction Error and Precision Error are measures for computing prediction accuracy in a multiple linear regression model fitted using Ordinary Least Squares (OLS).

A. True

B. False✅

Note: Use the following information for questions 13-18

A linear regression model was fitted to estimate the response variable Roll using the predicting variable Unem. There are in total 29 observations.

Here is the model summary, with some values missing:

```
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3957.0 4000.1 0.989 0.3313
UNEM 1133.8 513.1 X 0.0358 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3049 on Y degrees of freedom
Multiple R-squared: 0.1531, Adjusted R-squared: Z
F-statistic: 4.883 on 1 and ? DF, p-value: 0.03579
```

### Question 13

What is the degrees of freedom used to calculate the mean squared error (MSE)?

A. 29

B. 28

C. 27✅

D. 30

E. None of the above.

### Question 14

What is the value of the mean squared error (MSE) for this model?

A. 3049

B. 9296401✅

C. 55.218

D. 112.9259

E. None of the above

### Question 15

What is the correlation coefficient between the response variable, Roll, and the predictor, UNEM?

A. 0.3913✅

B. 0.1531

C. 0.0234

D. 0.3489

E. None of the above

### Question 16

Using the model output above, what is the t-value of the coefficient for UNEM, labeled X?

A. 0.45

B. 112.93

C. 2.21✅

D. 3.48

E. None of the above

### Question 17

For a significance level $\alpha = 0.01$, what can you say about the statistical significance of the predictor UNEM?

A. The predictor is statistically significant.

B. The predictor is not statistically significant.✅

C. The result is inconclusive.

D. None of the above.

### Question 18

What is the value of the adjusted R-Squared, labeled Z?

A. 0.119

B. 0.185

C. 0.153

D. 0.122✅

E. 0.878

F. None of the above.

## Multiple Choice & Multiple Answers - Section 2

### Question 19

A dataset on sports related concussions contains **303 data points** and **three sports**. Group A is football, Group B is rugby, and Group C is hockey. What are the degrees of freedom of the F-test when comparing treatments using ANOVA?

A. F (303, 3)

B. F (3, 303)

C. F (2, 300)✅

D. F(2,303)

### Question 20

Select all of the statements that are TRUE.

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. In simple linear regression, to evaluate the assumption of uncorrelated errors, we can use the scatter plot of the residuals vs. fitted values✅

B. In simple linear regression, we have three parameters to estimate✅

C. The residuals cannot be used as proxies for the error terms in a simple linear regression model

D. In simple linear regression, the prediction interval is used to provide an interval estimate for the true average value of y for all members of the population with a particular value of x*

### Question 21

Select all of the statements that are TRUE.

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. Large sample size data can inflate statistical significance in multiple linear regression.✅

B. In multiple linear regression, the larger the Variance Inflation Factor (VIF), the smaller the confidence interval.

C. In multiple linear regression, Variance Inflation Factor (VIF) evaluates the correlation of pairs of predicting variables and not the correlation between a predicting variable and the linear combinations of the other predicting variables.

D. In multiple linear regression, the standard error can be inflated due to a high Variance Inflation Factor (VIF)✅

E. In multiple linear regression, a Variance Inflation Factor (VIF) of 1 means the tested predictor is not correlated with the other predictors.✅

### Question 22

In simple linear regression, what do we mean when we say a predictor lacks statistical significance?

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. The conditional probability of our observed responses, given the predictor, is not great enough to suggest it is due to anything other than randomness within the data✅

B. The marginal probability of the predictor is not great enough to suggest that it departs from normality

C. The predictor variable has a large effect size.

D. The predictor variable does not cause changes in the response variable.

E. The predictor variable does not have a statistically significant association with the response variable.✅

### Question 23

Select all of the statements that are TRUE.

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. In multiple linear regression, when performing the F-test to test the overall significance of a model, the null hypothesis is that all the regression coefficients including the intercept are 0 versus the alternative that at least one is not 0.

B. In ANOVA, when testing for equal means, the alternative hypothesis is that at least one pair of means are statistically significantly different from each other.✅

C. ANOVA and SLR share all model assumptions in common except for constant variance.

D. ANOVA is a linear regression model with categorical variable(s).✅

E. For small sample sizes, to test whether an explanatory variable in simple linear regression is statistically significantly positive, we conduct a one-tailed test using a Z distribution.

F. In linear regression, a confidence interval is always more narrow than its corresponding prediction interval.✅

### Question 24

Select all of the statements that are TRUE.

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. A partial F-test can be used to test the null hypothesis that the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero.✅

B. In multiple linear regression, the uncertainty of the prediction of a new response comes from the newness of the observation and the estimation of the regression coefficients.✅

C. To evaluate the assumptions of multiple linear regression, we must always use the estimated residuals without standardization.

D. In multiple linear regression, the adjusted $R^2$ can be used to compare models, and its value will always be greater than or equal to that of $R^2$?.

### Question 25

Select all of the statements that are TRUE.

[Select all that apply]

Note: for the multiple answer questions, an incorrect answer cancels out a correct answer.

A. Multicollinearity can cause incorrect conclusions about the statistical significance of the regression coefficients in a multiple linear regression model.✅

B. In multiple linear regression, the sub-sampling approach is not recommended for addressing the p-value problem with large samples.✅

C. In multiple linear regression, the regression coefficients estimated using the least squares method are unbiased estimates of the true regression coefficients.✅

D. When performing residual analysis for a simple linear regression model, using data from observational studies, we check for the lack of correlation not for independence.

## 编程题目

### Instructions

The R Markdown/Jupyter Notebook file includes the questions, the empty code chunk sections for your code, and the text blocks for your responses. Answer the questions below by completing the R Markdown/Jupyter Notebook file. You may make slight adjustments to get the file to knit/convert but otherwise keep the formatting the same.

Once you've finished answering the questions, submit your responses in a single knitted file as *HTML* only.

*Next Steps:*

Save the .Rmd/.ipnyb file in your R working directory - the same directory where you will download the "home_energy_consumption.csv" data file into. Having both files in the same directory will help in reading the "home_energy_consumption.csv" file.

Read the question and create the R code necessary within the code chunk section immediately below each question. Knitting this file will generate the output and insert it into the section below the code chunk.

Type your answer to the questions in the text block provided immediately after the response prompt.

Once you've finished answering all questions, knit this file and submit the knitted file

*as HTML*on Canvas.`* Make sure to start submission of the exam at least 10 minutes before the end of the exam time. It is your responsibility to keep track of your time and submit before the time limit. * If you are unable to knit your file as HTML for whatever reason, you may upload your Rmd/ipynb/PDF/Word file instead. However, you will be penalized 10%. * If you are unable to upload your exam file for whatever reason, you may IMMEDIATELY attach the file to the exam page as a comment via Grades-> Midterm Exam 1 - Open Book Section (R) - Part 2 -> Comment box. * Note that you will be penalized 10% (or more) if the submission is made within 5 minutes after the exam time has expired and a higher penalty if more than 5 minutes. Furthermore, you will receive zero points if the submission is made after 15 minutes of the exam time expiring. We will not allow later submissions or re-taking of the exam. * If you upload your file after the exam closes, let the instructors know via a private Piazza post. Please DON'T attach the exam file via a private Piazza post to the instructors since you could compromise the exam process. Any submission received via Piazza will not be considered. *#Commented out code will be graded for partial credit and the submitted file must be HTML`

#### Mock Example Question

This will be the exam question - each question is already copied from Canvas and inserted into individual text blocks below. You do not need to copy/paste the questions from the online Canvas exam.

**Mock Response to Example Question**: This is the section where you

type your written answers to the question.

**Ready? Let's begin.**

### Background

In this exam, you will be considering various attributes to predict the monthly energy consumption of a household.

The dataset contains the following variables:

Household Size: The number of people living in the household. (Quantitative variable)

Home Size: The size of the home in square feet. (Quantitative variable)

Number of Rooms: The total number of rooms in the home. (Quantitative variable)

Household Income: The household's annual income in US dollars. (Quantitative variable)

Type of Home: The classification of the home, such as "Detached house," "Townhouse," or "Semi-detached house." (Qualitative variable)

Heating System Type: The type of heating system used in the home, such as "Solar" or "Gas." (Qualitative variable)

Cooling System Type: The type of cooling system used in the home, such as "Window units," "Central AC," or "None." (Qualitative variable)

Insulation Quality: A rating (from 1 to 5) of the quality of the home's insulation, with 5 being the best quality. (Quantitative variable)

Ownership Status: Whether the household owns or rents the home (e.g., "Owner" or "Renter").

Work from Home Frequency: The number of days per week that household members work from home. (Quantitative variable)

Smart Home Devices: Indicates whether the household has smart home devices installed ("Yes" or "No"). (Qualitative variable)

Solar Panel Installation: Indicates whether the home has solar panels installed ("Yes" or "No"). (Qualitative variable)

Monthly Energy Consumption: The household's monthly energy consumption in kilowatt-hours (kWh) (Response variable)

```
import pandas as pd
import numpy as np
#this seed has been set to 100 to ensure results are reproducible. DO NOT CHANGE THIS SEED
np.random.seed(100)
# Read the CSV file
energy_consumption = pd.read_csv("home_energy_consumption.csv")
# Convert relevant columns to categorical (factor) data types
energy_consumption['Type_of_Home'] = energy_consumption['Type_of_Home'].astype('category')
energy_consumption['Heating_System_Type'] = energy_consumption['Heating_System_Type'].astype('category')
energy_consumption['Cooling_System_Type'] = energy_consumption['Cooling_System_Type'].astype('category')
energy_consumption['Ownership_Status'] = energy_consumption['Ownership_Status'].astype('category')
energy_consumption['Smart_Home_Devices'] = energy_consumption['Smart_Home_Devices'].astype('category')
energy_consumption['Solar_Panel_Installation'] = energy_consumption['Solar_Panel_Installation'].astype('category')
# Splitting the data into training and testing sets (80% train, 20% test)
test_rows = np.random.choice(energy_consumption.index, size=int(0.2 * len(energy_consumption)), replace=False)
testData = energy_consumption.loc[test_rows]
trainData = energy_consumption.drop(test_rows).reset_index(drop=True)
# Display the first few rows of the training data
print(trainData.head())
```

### Question 1: Data Exploration (11 points)

Use the full dataset "energy_consumption" for this question

1a) (3 points) Calculate average monthly energy consumption by type of home. Which type of house shows the highest average monthly energy consumption?

`### Answer to Question 1a:`

1b) (3 points) Provide a boxplot showing the distribution of monthly energy consumption across different types of heating systems? Explain your interpretation of the plot.

`### Answer to Question 1b:`

1c) (5 points) Create scatterplots of the response variable against the following predictors:

a) Household size b) Household income c) Work from home frequency

i)Describe the general trend of each plot?

`### Answer to Question 1c(i):`

ii)Calculate the correlation coefficient between the response variable and

a) Household size b) Household income c) Work from home frequency.

Interpret the correlation coefficient of each of the predictors with the response variable. Use the following ranges while interpreting the correlation coefficients.

0.7 to 0.9 (or -0.7 to -0.9): Strong positive (or negative) correlation.

0.5 to 0.7 (or -0.5 to -0.7): Moderate positive (or negative) correlation.

0.3 to 0.5 (or -0.3 to -0.5): Weak positive (or negative) correlation.

0 to 0.3 (or 0 to -0.3): Very weak or no linear correlation.

`### Answer to Question 1c(ii):`

### Question 2: Multiple Regression Model (17 points)

2a) (6 points)

i) Using the dataset “trainData”, change the baseline for Heating_System_Type to "Solar". Use this baseline for all the models created in the exam

ii) Using the dataset "trainData", perform a multiple linear regression to predict the monthly energy consumption using the predicting variables "Number_of_Rooms" and "Heating_System_Type".Call it model1. Display the summary.

iii) How many model parameters are there?

`### Answer to Question 2a(iii):`

iv) Interpret the coefficient for the "Heating_System_TypeElectric" in the context of the problem. State any assumptions while interpreting the coefficient.

`### Answer to Question 2a(iv):`

v) How many residual degrees of freedom are there, and how are they calculated?

`### Answer to Question 2a(v):`

2b)(8 points) Create a full linear regression model using all the predictors in the dataset "trainData" .Call it model2. Display the summary.

i) What is the estimate of the error variance? Is it different than model1, if yes why?

`### Answer to Question 2b(i):`

ii) Interpret the coefficient corresponding to "Household_Size" in the context of the problem. State any assumptions while interpreting the coefficient.

`### Answer to Question 2b(ii):`

iii)Compare the R-squared and Adjusted R-squared values of the reduced and full models (model1 and model2). What do you observe? Explain the theoretical differences between R-squared and Adjusted R-squared. What does each measure?

`### Answer to Question 2b(iii):`

iv) Which coefficients of model2 are statistically insignificant at an alpha level of 0.01? Should we remove those coefficients from our model? Explain with reasoning.

`### Answer to Q2b(iv):`

2c) (3 points) Compare model1 and model2 using a partial F-test using an alpha level of 0.01?

State your conclusion based on the test.

`### Answer to Question 2c:`

### Question 3: Model Diagnostics (11 points)

3a) (4 points) Perform the following model diagnostics on model2 (the full model).

i) Check for constant variance.

ii) Check for normality. (Both QQplot and histogram are required to check this assumption).For the QQplot, 95% confidence envelope should be plotted.

Explain your findings based on the diagnostic plots.

`## Answer to Question 3a:`

3b) (7 points) Create a linear regression model named model3 that uses the log-transformed response variable. Include all predictors from the dataset trainData, and add an interaction term by multiplying the predictors: Household Size, Home Size, and Number of Rooms.

Tip: Interaction term = Household Size* Home Size * Number of Rooms

i) Is the interaction term statistically significant at an alpha level of 0.01?

`### Answer to Question 3b(i):`

ii) Compare the R-squared and adjusted R-squared values of model2 and model3?

`### Answer to Question 3b(ii):`

iii)Perform the same model diagnostics (constant variance and normality assumption) on model3 as performed in Q3a on model2. Explain ways we can deal with constant variance and normality assumption in a model if they do not hold.

`### Answer to Q3b(iii)`

### Question 4: Multicollinearity and outliers. (12 points)

4a) (2 points) Diagnose multicollinearity in model2 created in Question 2b by calculating the Variance Inflation Factor (VIF) for each predictor. Based on the calculated VIF values, is multicollinearity a concern?

`### Answer to Question 4a:`

4b) (3 points) Use the Cook’s distance to count outliers in the data based on model2.

i) Plot the Cook's distance for each observation.

ii)Using the threshold 4/n, state clearly the number of outliers.

4c) (7 points) Remove the outliers (indicated in 4b) from the dataset "trainData". Create a linear regression model, using the dataset without the outliers. Use all the predictors. Call it model4.

i) Compare the R-squared and adjusted R-squared values of model4 and model2. Which model performed better?

`### Answer to Question 4c(i):`

ii) How does the presence or absence of these outliers affect the model’s regression coefficients? Do you observe any significant changes? Explain

`### Answer to Question 4c(ii):`

iii) What is the sampling distribution of the estimated regression coefficient corresponding to "Number of Rooms" in model4?

`### Answer to Question 4c(iii):`

iv) Is this estimated coefficient of "Number of Rooms" statistically significant at an alpha level of 0.01 in model4?

`### Answer to Question 4c(iv):`

### Question 5: Prediction (9 points)

Note: Use "testData" for all questions in Q5

5a)(6 points) Using testData, predict the monthly energy consumption for each of the models below:

i) model2 (question 2b)

ii) model3 (question 3b)

iii) model4 (question 4c)

Calculate the precision measure for predictions of all the models. Which model performed the best according to precision measure? Why do we focus on the precision measure?

`### Response to Q5a`

Q5b) (3 points) Extract the first row of testData. Using model2, what is the 99% prediction interval (PI) of the monthly consumption of energy (kWh)? Provide an interpretation of your results.

`### Response to Q5b`

### End of Exam

## 公众号：AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦，包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」，全部都是一对一教学：一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然，还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线，随时响应！微信：Jiabcdefh

C++ 信息奥赛题解，长期更新！长期招收一对一中小学信息奥赛集训，莆田、厦门地区有机会线下上门，其他地区线上。微信：Jiabcdefh

方法一：QQ

方法二：微信：Jiabcdefh

- 0
- 0
- 0
- 0
- 0
- 0