# Advanced Linear Modelling and Classification Assessed Computer Practical 2

PLEASE READ CAREFULLY

This computer practical counts 25% towards your final mark on this unit. It is due on 2nd May 2024, by 3pm and should be submitted on Blackboard. You should submit your computer practical answers as a single pdf file containing your knitted Markdown file and handwritten answers if applicable – please name the file after your first name and surname. You should number your answers, both in the Markdown file and your handwritten notes, if applicable.

FAILURE TO SUBMIT YOUR ASSESSED COMPUTER PRACTICAL IN THE REQUIRED FORMAT WILL RESULT IN IT NOT BEING MARKED.

The objective of this computer practical is to explore the factors that influence the price of houses using the Boston dataset (available from the R package MASS).

The dataset contains the median value of owner-occupied homes in $1000s (variable medv) in 506 suburbs of Boston. To simplify, we consider only five variables to explain the housing values in the different suburbs, namely the proportion of non-rental business acres (variable indus), the nitro- gen oxides concentration (variable nox), the proportion of owner-occupied units built prior to 1940 (variable age), the pupil-teacher ratio (variable ptratio) and the lower status of the population in percent (variable lstat).

The dataset that we will use below can be obtained as follows:

```
library(MASS)
boston<-Boston[,c(3,5,7,11,13,14) ]
head(boston)
## indus nox age ptratio lstat medv
## 1 2.31 0.538 65.2 15.3 4.98 24.0
## 2 7.07 0.469 78.9 17.8 9.14 21.6
## 3 7.07 0.469 61.1 17.8 4.03 34.7
## 4 2.18 0.458 45.8 18.7 2.94 33.4
## 5 2.18 0.458 54.2 18.7 5.33 36.2
## 6 2.18 0.458 58.7 18.7 5.21 28.7
```

Fit a linear regression model with medv as response variable and the variables indus, nox, age, ptratio and lstat as predictors. Comment on your results.

Make a plot with the observed values for medv on the y-axis and the ones predicted by the linear model on the x-axis, and fit a linear regression model with medv as response variable and its predicted value as predictor. Comment on your results.

Fit an additive model with medv as response variable and the variables indus, nox, age, ptratio and lstat as predictors, using GCV for choosing the penalty parameters and natural cubic splines as basis functions.

You answer should notably contain the following:

- A justification for the number of basis functions that you use for each smooth term of the model.
- If applicable, some justification that thinning can be applied without loosing “too much” information.
- Some evidence that the fitted model you obtain does not over-fit the data.

In addition, discuss the relevance of having a smooth term for each variable.

Make a plot with the observed values for medv on the y-axis and the ones predicted by the model fitted in the previous question on the x-axis, and fit a linear regression model with medv as response variable and its predicted value as predictor. Comment on your results.

The additive model can be seen as a GAM where the exponential family of distributions used to define the model is the family of Gaussian distributions. Propose another exponential family of distributions relevant for the considered dataset and fit the resulting GAM, following the instructions given in Question 3 for fitting the additive model.

Make a plot with the observed values for medv on the y-axis and the ones predicted by the model fitted in the previous question on the x-axis, and fit a linear regression model with medv as response variable and its predicted value as predictor. Comment on your results.

Fit a kernel ridge regression model with medv as response variable and the variables indus, nox, age, ptratio and lstat as predictors, using a kernel k of your choice such that the corresponding function space H is infinite dimensional. Define precisely the kernel you decide to use and use cross-validation for choosing its parameter(s) and the penalty parameter of the model.

If you numerically minimize the GCV criterion you need to perform the optimization for dif- ferent starting values. This is to assess the sensitivity of the procedure that you use to the initialization of the optimization algorithm.

Make a plot with the observed values for medv on the y-axis and the ones predicted by the model fitted in the previous question on the x-axis, and fit a linear regression model with medv as response variable and its predicted value as predictor. Comment on your results.

Which of the above four models would you recommand to use for the considered task? Justify precisely your answer using the results you obtained above.

## 公众号：AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦，包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」，全部都是一对一教学：一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然，还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线，随时响应！微信：Jiabcdefh

C++ 信息奥赛题解，长期更新！长期招收一对一中小学信息奥赛集训，莆田、厦门地区有机会线下上门，其他地区线上。微信：Jiabcdefh

方法一：QQ

方法二：微信：Jiabcdefh

- 0
- 0
- 0
- 0
- 0
- 0