HW1 Part 2 - MGT 6203
For Homework 1 Part 2, please use this R notebook in Vocareum to submit your solutions. Vocareum is an educational cloud platform for programming in several languages; it is based on the Jupyter notebook environment. This platform allows us to move homework assignments to the cloud. The advantages are that all of you will be working in the same coding environment AND peer reviewers will be able to run your R code easily. This way we eliminate some issues we might encounter when working on an individual/local basis, such as library installations and Rstudio OS requirements; R notebooks work on mobile platforms and tablets.
**With R notebooks, you will be learning a new way of presenting data analysis reports, that is neat and flexible, where formatted (English) text and (R) code can easily coexist on the same page. Notebooks can be also collaborative when needed. For now, we are asking each of you to do your own work for homework. Think of R notebooks as interactive program-based Google docs or MS-Office 360 docs; these are gradually replacing local files on our computers. **
Many of you are new to the R notebooks and Vocareum platforms. We will provide TA help in Piazza with specific code if you have questions. Here we list some important things to get you started. Please read through them carefully.
Even though we are moving from your local envrionment to the cloud, our expectations from your homework will remain the same. Same goes for the rubrics.
Vocareum has its own cloud based file system, the data files you will be using for the assignments will be stored in the cloud with path "../resource/asnlib/publicdata/FILENAME.csv". You will be able to import them with the same method as you do in RStudio, simply substitute the path name to the one specified in the instructions. You won't be able to modify these data files.
You will be able to find the data files on Canvas/EdX if you would like to explore them offline.
For coding questions, you will be graded on the R code as well as the output in your submission.
For interpretations or short response questions, please type the answers in the notebook's markdown cells. To change a code cell to a markdown cell, click on the cell, and in the dropdown menu above, switch the type of the cell block from "code" to "markdown". Adding print statements to code cells for short response/interpretation questions is also fine, as long as we can clearly see the output of your response.
You don't need to, but if you would like to learn more about how to format your markdown cells, visit the following site: https://www.earthdatascience.org/courses/intro-to-earth-data-science/file-formats/use-text-files/format-text-with-markdown-jupyter-notebook/. Jupyter notebook also support LaTeX.
Feel free to delete or add as many additional cells as you need. But please try to keep your notebook clean and keep your solution to a question directly under that question to avoid confusions.
You may delete the #SOLUTION BEGINS/ENDS HERE comments from the cell blocks, they are just pointers that indicates where to put you solutions.
When you have finished the assignment, remember to rerun your notebook to check if it runs correctly. You can do so by going to Kernel-> Restart & Run All. You may lose points if your solutions does not run successfully.
Click the "Submit" button on the top right corner to turn in your assignment. Your assignment will enter the next phase for peer review.
You are allowed a total of 2 submissions for this assignment. So make sure that you submit your responses carefully. You will be able to come back and resubmit your assignment as long as it is before the start of the peer review period.
Please remember to finish the peer reviews after you have submitted your assignment. You are responsible for grading the work of three of your peers thoroughly, and in addherence to the rubrics. And you will be held accountable for peer grading. There will be a 30% penalty to your grade if you fail to complete one or more peer reviews in proper fashion.
Feel free to address your questions, concerns, and provide any feedback on Piazza. We will continuously try to improve going forward.
Good Luck!
通过R笔记本,你们将学习一种展示数据分析报告的新方法,这种方法整洁且灵活,在其中,格式化(英文)文本和(R)代码可以轻松地在同一页上共存。需要时,笔记本也可以协作。但目前,我们要求每个人都为作业完成自己的工作。请将R笔记本视为互动的基于程序的Google文档或MS-Office 360文档;这些文档正在逐渐取代我们计算机上的本地文件。
Vocareum有自己的基于云的文件系统,您将用于作业的数据文件将存储在云中,路径为 "../resource/asnlib/publicdata/FILENAME.csv"。您可以使用与在RStudio中相同的方法导入它们,只需将路径名替换为说明中指定的路径名。您将无法修改这些数据文件。
您可以从单元块中删除#SOLUTION BEGINS/ENDS HERE注释,这些只是指示放置解答的指针。
完成作业后,请记得重新运行您的笔记本,以检查它是否正确运行。您可以通过转到Kernel-> Restart & Run All来实现这一点。如果您的解答不能成功运行,您可能会失去分数。
About Package Installation:
Most of the packages (if not all) that you will need to complete this assignment are already installed in this environment. An easy way to check this is to run the command: library(PackageName). If this command runs successfully then the package was already installed and has been successfully attached to the code. If the command gave an error saying the Package was not found then follow the steps below to successfully install the package and attach it to the code:
Use installed.packages() command to return a table of the packages that are preinstalled in the environment.
To attach a preinstalled library in Vocareum, simply use library(PackageName)
To install a package that does not come with the provided environment, please use the following syntax:
install.packages("PackageName", lib="../work/")
To attach a library you just installed, use the following syntax:
library(PackageName, lib.loc="../work/")
Make sure the file location is the same as the above code snippets ("../work/")
Q1. Use airbnb_data for the following questions. Use the below codes to load library and read the data. (32 points total)
# Load the data
airbnb_data = read.csv("../resource/asnlib/publicdata/airbnb_data.csv",header = TRUE)
# Change headers to lower case (my personal preference for all analysis)
names(airbnb_data) = tolower(names(airbnb_data))
# Strip to non-ID fields
removeMe = c("room_id", "survey_id", "host_id", "city")
myDF = airbnb_data[, -which(names(airbnb_data) %in% removeMe)]
# Explore the data
# Update room type as factor
myDF$room_type = as.factor(myDF$room_type)
'data.frame': 854 obs. of 6 variables:
$ room_type : chr "Shared room" "Shared room" "Shared room" "Shared room" ...
$ reviews : int 0 32 4 24 152 20 52 14 3 30 ...
$ overall_satisfaction: num 0 5 4.5 4.5 4.5 4.5 4.5 4.5 5 5 ...
$ accommodates : int 4 4 2 6 6 4 5 2 6 5 ...
$ bedrooms : int 1 1 1 1 1 1 1 1 3 2 ...
$ price : int 91 77 35 31 36 29 20 31 51 168 ...
room_type reviews overall_satisfaction accommodates
Length:854 Min. : 0.00 Min. :0.00 Min. : 1.000
Class :character 1st Qu.: 8.00 1st Qu.:4.50 1st Qu.: 2.000
Mode :character Median : 28.00 Median :5.00 Median : 3.000
Mean : 49.11 Mean :4.18 Mean : 3.412
3rd Qu.: 65.00 3rd Qu.:5.00 3rd Qu.: 4.000
Max. :602.00 Max. :5.00 Max. :17.000
bedrooms price
Min. : 0.000 Min. : 20.0
1st Qu.: 1.000 1st Qu.: 75.0
Median : 1.000 Median : 102.0
Mean : 1.352 Mean : 140.9
3rd Qu.: 2.000 3rd Qu.: 153.8
Max. :10.000 Max. :4625.0
A. Fit a multiple linear regression model using price as the response variable, all others as predictor variables, and "Private room" as the base case. (Note: Do not fit a model using id columns and city as predictors).
Which variables are statistically significant at the alpha = 0.01 level. (6 points)
A. 使用price作为响应变量,使用所有其他变量作为预测变量,并以"Private room"作为基准情况,拟合一个多元线性回归模型。(注意:不要使用id列和city作为预测因子来拟合模型)。
在alpha = 0.01的水平上,哪些变量在统计上是显著的?(6分)
Estimate:这是每个变量的系数估计值。对于数值变量,这表示当该变量增加1个单位时,响应变量(在这里是价格)预期会增加(或减少,如果系数为负)的数量。对于因子变量或哑变量,它表示相对于参考类别(在这里,"Entire home/apt"很可能是参考类别,因为它在输出中没有显示)的响应变量的差异。
Std. Error:这是系数估计值的标准误差。它为我们提供了关于系数估计值稳定性的信息。
t value:这是t统计量,它是系数估计值除以其标准误差得到的。它用于测试特定系数是否显著地不同于0。
“Signif. codes”下方的行为我们提供了p值和它们的显著性标记之间的关系。
表格下方的其他信息如"Residual standard error"和"Multiple R-squared"提供了模型整体的拟合度信息。
对于'entire home'(哑变量)和'bedrooms':
room_typeShared room(与“Entire home/apt”相比):系数为-74.31,这意味着,当其他所有变量都保持不变时,相比于整套房屋/公寓(Entire home/apt,作为基线),选择共享房间(Shared room)会导致价格平均降低74.31单位。
'entire home'(哑变量):由于没有直接给出“Entire home”与“Private room”之间的对比,我们可以使用“Private Room”和“Shared Room”的系数来间接解释。当其他所有变量都保持不变时,相比于“Private Room”,选择整套房屋/公寓(Entire home/apt)的价格平均增加值为:2.18 (Private room系数的绝对值) + 74.31 (Shared room系数) = 76.49单位。
Q2. 使用 direct_marketing 数据集回答以下问题。(总分 12 分)
为 'history' 列创建指示变量。将基线情况考虑为 None(即,如果 history 为 Low,则 Low = 1,否则为 0;如果 history 为 Medium,则 medium = 1,否则为 0,等等)并基于客户历史类型创建一些交互项 lowsalary、mediumsalary 和 highsalary,例如 mediumsalary = medium x salary 等等。
# 读入数据框
mydm <- read.csv("../resource/asnlib/publicdata/direct_marketing.csv", header = TRUE)
# 将表头更改为小写(我个人在所有分析中的偏好)
names(mydm) = tolower(names(mydm))
# 根据问题参数创建列
mydm2 <- mydm %>%
mutate(low = ifelse(history == "Low", 1,0)) %>%
mutate(medium = ifelse(history == "Medium", 1,0)) %>%
mutate(high = ifelse(history == "High", 1,0)) %>%
mutate(lowsalary = salary * low) %>%
mutate(mediumsalary = salary * medium) %>%
mutate(highsalary = salary * high)
A. 使用 'amountspent' 作为响应变量,使用以下预测变量进行多重线性回归建模:age、ownhome、salary、low、medium、high、lowsalary、mediumsalary、highsalary。 打印摘要。 (6 分)
B. 根据第2(a)部分构建的模型,中年租客在每种历史类型(None、Low、Medium 和 High)的预测消费金额是多少,假设他们的薪资为 $75,000?(6 分)

