Homework 1 Data Analysis

AI悦创原创2024年9月8日大约 7 分钟...约 2166 字

Submit your responses in a single knitted file, following the Homework Structure which requires model outputs, plots and code to all questions in a single HTML file by Sept 8 at 11:59 p.m. Eastern. [Time converterLinks to an external site.]
The solution will be made available on Sept 10 at 1:00 a.m. Eastern. You will need to assess three of your peers to complete the peer-assessment by Sept 15 at 11:59 p.m. Eastern. Students who do not submit the graded assessments before the deadline will receive a zero on their own Data Analysis.

3. Before you begin, please review the following document: DONT_CHEAT.docx

. Cheating is not helping you learn and be successful. It does the opposite. While you may collaborate with other students, please submit your own responses to the data analysis questions. Please refrain from consulting prior homework solutions or other materials that provide answers to the data analysis questions. Any case identified as potential plagiarism will result in a zero grade for the assignment and it will be reported to the OMSA program.

To maintain the integrity of this course:

1. Do not plagiarize (even if it is a particular question). This is an automatic zero for the entire HW and your lowest HW grade will not be dropped.

2. Do not use any AI tools such as chatGPT or CoPilot.

3. Non-HTML submissions are not accepted. Submitting the wrong file type is an automatic zero.

Data Analysis (60 points)

Refer to the Homework Structure for Data Analysis and Peer Review expectations.

For this first assignment, you are provided with a starter R Markdown and Jupyter Notebook template in R and python languages:

Fall2024-HW1-Starter-Template-Python.ipynb Download Fall2024-HW1-Starter-Template-Python.ipynb

Fall2024-HW1-Starter-Template-R.Rmd Download Fall2024-HW1-Starter-Template-R.Rmd

Fall2024-HW1-Starter-Template-R.ipynb Download Fall2024-HW1-Starter-Template-R.ipynb

Dataset

large_sales_dataset.csv Download large_sales_dataset.csv

Question 1: Data Analysis (8 points)

For this question, use the "Sales_dataset"

1a)(4 points) Output a table that has both the average and median Purchase_Amount grouped by Product_Category, Customer_Gender and Customer_Age. Show the last 15 rows of the table

Note: For age group, use the following bins

'0-18', '19-25', '26-35', '36-45', '46-55', '56-65', '66-75', '76-85', '86-100'

Q1b)(2 points) Which customer age group has the highest total purchase amount across all categories?

Q1c)(2 points) Which product category has the highest average satisfaction score, and in which location is it most frequently purchased?

Question 2: Time Series Analysis (8 points)

For this question, use "Sales_dataset"

Q2a) (3 points) Calculate the monthly average sales per product category and plot them.

Q2b)(5 points) Calculate the rolling means of average monthly sales (calculated in 2a) using a 3 month window. Use the center alignment to calculate the rolling means.

Now again plot the monthly average sales of each product category along with the rolling means.

i) What difference do you observe? How are the rolling means beneficial as compared to simple averages?

Question 3: Data Exploration (8 points)

For this question, use trainData_sales

Q3a)(4 points) Create a boxplot of the response variable versus the following predicting variables

i) Store_Location

ii) Product_Category

Explain the relationship between the response and the two variables based on the boxplots.

Q3b)(4 points) Create scatterplots of the response variable against the following predictors:

i) Customer_Age

ii) Purchase_Amount

Describe the general trend of each plot.

Output the R^2 for each plot. Use the following R^2 cut-offs while explaining if it is a weak, moderate, or strong relationship.

R^2<=0.3 (weak)

0.3<R^2<0.7 (moderate)

R^2>=0.7 (strong)

Question 4: Simple Linear Regression and ANOVA (12 points)

For this question, use trainData_sales

Q4a) (5 points) Create a simple linear regression model using the predictor "Purchase_Amount". Call it model1. Display the summary.

i) What are the model parameters and their estimates?

ii) Interpret the coefficient of the predictor "Purchase_Amount" in the context of the problem.

iii) Find a 95% confidence interval for the coefficient of "Purchase_Amount". Is the coefficient significant at this level?

Q4b)(3 points) Is the coefficient of Purchase_Amount statistically significant?

i) State the null hypothesis

ii) State the alternative hypothesis

iii) Which test is used for testing the significance of coefficient?

Q4c)( 4 points) Perform an ANOVA F-test on the means of Product_Category.

i) State the null and alternative hypotheses.

ii) Using an α-level of 0.01, do we reject the null hypothesis that the means are equal? Explain your conclusion.

iii) Which means are plausibly similar at the confidence level of 99%?

iv) Compare the satisfaction score of the following pair of product categories:

Electronics-Clothing

Question 5: Model Diagnostics (14 points)

Q5a)(4 points) Perform the following model diagnostics on model1 created in Q4a.

i) Check for linearity assumption

ii) Check for constant variance

iii) Check for normality

Note: Both a histogram and a normal QQ plot with a pointwise confidence envelope must be plotted (tip: qqPlot() from the car package can generate a pointwise confidence envelope.

Explain your conclusion

Q5b)(1 point) Based on your conclusion in Q5a, would you propose any transformation of the predictor or response variable? Explain with reasoning.

Q5c)(2 points) Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal λ value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?

Q5d)(2 points) Create a linear regression model, named model2, that uses the log transformed Satisfaction_Score as the response, and the log transformed Purchase_Amount as the predictor. Display the summary.

Q5e)(2 points) Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?

Q5f) (3 points) Perform the same model diagnostics on model2 as you did on model1 in Q5a. Assess and interpret all model assumptions. Based on your interpretation of the model assumptions, is model2 a better fit than model1? A model is considered a good fit only if all the assumptions hold.

Question 6: Prediction (5 points)

For this question, use testData_sales

Q6a)(3 points) Using testData_sales, predict the satisfaction score using model1.

Show the first 10 predictions along with their true values.
Calculate the mean squared prediction error.
Calculate average and standard deviation of the predictions of the satisfaction score using model1.

Q6b)(2 points) Using the first row of testData_sales, predict the satisfaction score using model1. What is the 99% prediction interval (PI)? Provide an interpretation of your results.

Question 7: Research Question (2 points)

(2 points) Research and explain the arguments that go into the predict function in detail. Please include object, newdata, interval, type, se.fit, and level.

Question 8: Changing the Baseline (3 points)

(3 points) In the pre-processing of the data we did: Sales_dataset$Product_Category = as.factor(Sales_dataset$Product_Category) , explain ways using both code and descriptions that you can tell what the baseline is and how you can change the baseline for Product_Category.

评分标准说明

HW1 Peer Assessment Rubric (2)

标准	等级	得分
此标准已链接至学习结果Question 1a	4 得分4both code and table are correct3 得分3correct approach but output is wrong2 得分2incomplete code but correct approach0 得分0All Wrong	4 分
此标准已链接至学习结果Question 1b	2 得分2All correct1 得分1correct approach but output is wrong, minor mistake in code0 得分0Unanswered or incorrect.	2 分
此标准已链接至学习结果Question 1c	2 得分2All correct1 得分1correct approach but output is wrong, minor mistake in code0 得分0Incorrect.	2 分
此标准已链接至学习结果Question 2a	3 得分3All Correct2 得分2correct monthly avg output but plot is wrong or not present1 得分1minor mistake in code affecting the output0 得分0None Correct	3 分
此标准已链接至学习结果Question 2b	5 得分5All Correct4 得分4correct plots, wrong explanation of differences2 得分2one plot correct0 得分0None Correct	5 分
此标准已链接至学习结果Question 3a	4 得分4All Correct2 得分2boxplots correct, wrong explanation OR 1 boxplot correct with explanation0 得分0All wrong	4 分
此标准已链接至学习结果Question 3b	4 得分4All Correct2 得分2correct scatterplot, wrong explanation0 得分0None Correct	4 分
此标准已链接至学习结果Question 4a	5 得分5All Correct2.5 得分2.52/3 parts correct1 得分11/3 correct0 得分0None Correct	5 分
此标准已链接至学习结果Question 4b	3 得分3All Correct2 得分22/3 correct1 得分11/3 correct0 得分0None correct	3 分
此标准已链接至学习结果Question 4c	4 得分4All correct3 得分33/4 correct2 得分22/4 correct1 得分11/4 correct if iii and iv are wrong, give 1 point0 得分0none correct	4 分
此标准已链接至学习结果Question 5a	4 得分Full Marks3 得分3correct but wrong explanation2 得分2partially correct output and explanation0 得分No MarksNone correct	4 分
此标准已链接至学习结果Question 5b	1 得分Full Marks0.5 得分0.5partially correct0 得分No Marks	1 分
此标准已链接至学习结果Question 5c	2 得分Full Markscorrect code and lambda1 得分1partially correct0 得分No Marks	2 分
此标准已链接至学习结果Question 5d	2 得分Full Marks1 得分1partially correct0 得分No Marks	2 分
此标准已链接至学习结果Question 5e	2 得分Full Markscorrect comparison0 得分No Marks	2 分
此标准已链接至学习结果Question 5f	3 得分Full Marks1.5 得分1.5partially correct0 得分No Marks	3 分
此标准已链接至学习结果Question 6a	3 得分Full Marks2 得分2minor mistake affecting the output1.5 得分1.5partially correct0 得分No Marks	3 分
此标准已链接至学习结果Question 6b	2 得分Full MarksAll correct1 得分1partially correct0 得分No Marks	2 分
此标准已链接至学习结果Question 7	2 得分Full MarksAll correct1 得分1partially correct0 得分No Marks	2 分
此标准已链接至学习结果Question 8	3 得分Full MarksAll correct1.5 得分1.5partially correct0 得分No Marks	3 分
总得分： 60 ，满分 60

公众号：AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦，包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」，全部都是一对一教学：一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然，还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线，随时响应！微信：Jiabcdefh

C++ 信息奥赛题解，长期更新！长期招收一对一中小学信息奥赛集训，莆田、厦门地区有机会线下上门，其他地区线上。微信：Jiabcdefh

方法一：QQ

方法二：微信：Jiabcdefh