Test 2 Data Science II
Data Science (2)
A data file, icecream.csv
, has been provided which contains information on daily ice cream sales for a beach ice cream stall. The dataset contains the following information:
- average temperature of the day in C
- number of ice creams sold
- whether or not it rained on the day (1=True, 0=False)
- number of hours of sunshine on the day
!pip install pandas scikit-learn seaborn matplotlib
Exercise 1
(a) Load the data from icecream.csv
and print out a quick statistical summary of the dataset, making sure that the summary will include all data types. (1 mark)
#### ANSWER HERE
import pandas as pd
# 加载数据集
icecream_data = pd.read_csv("data/icecream.csv")
# 使用 describe 方法显示数据集的统计摘要。include='all'表示包括所有数据类型的统计摘要。
summary = icecream_data.describe(include='all')
summary
(b) Print the correlation matrix and programmatically identify the correlation between the number of ice creams sold and the temperature. (1 mark)
#### ANSWER HERE
# 使用 corr 方法计算数据集的相关性矩阵
correlation_matrix = icecream_data.corr()
# 获取冰激凌销售量和温度之间的相关性
# 使用 loc 方法从相关性矩阵中获取冰激凌销售量和温度之间的相关性
ice_cream_temp_correlation = correlation_matrix.loc['ice_cream', 'temp']
# 打印冰激凌销售量和温度之间的相关性
print(ice_cream_temp_correlation)
correlation_matrix
#### ANSWER HERE
冰激凌销售量和温度之间的相关性为0.876,这表明两者之间存在较强的正相关关系。也就是说,当温度上升时,冰激凌的销售量也有上升的趋势。
Exercise 2
The correlation matrix indicated a moderate positive correlation between number of hours of sun and number of ice cream sales, in addition to the higher correlation between temperature and ice cream sales. Let's see whether we can use both temperature and number of hours of sun to predict the number of ice cream sales.
(a) To prepare for this, create a testing and a training dataset using the helper method available in scikit-learn. (2 marks)
Hint: Remember that you need to use both temperature and number of hours of sun to predict the number of ice cream sales. This means that your training data will have two features.
#### ANSWER HERE
# conda install scikit-learn
# 导入 scikit-learn 的 train_test_split 方法
from sklearn.model_selection import train_test_split
# 定义特征变量 X 和目标变量 y
X = icecream_data[['temp', 'sun_hrs']]
y = icecream_data['ice_cream']
# 使用 train_test_split 方法划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 打印训练集和测试集的形状
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# """
# 训练集有80个样本,每个样本有2个特征。
# 测试集有20个样本,每个样本有2个特征。
# 训练集的目标变量有80个样本。
# 测试集的目标变量有20个样本。
# """
(b) Set up a linear regression model and train it on the training dataset that you have just created. (1 mark)
#### ANSWER HERE
# 导入 scikit-learn 的 LinearRegression 类
from sklearn.linear_model import LinearRegression
# 创建线性回归模型实例
lr_model = LinearRegression()
# 使用训练数据集训练模型
lr_model.fit(X_train, y_train)
(c) Use the model to predict the number of ice cream sales in the test dataset. (1 mark)
#### ANSWER HERE
# 使用模型进行预测
y_pred = lr_model.predict(X_test)
y_pred
(d) Print out the R2 score for this model. (1 mark)
#### ANSWER HERE
from sklearn.metrics import r2_score
# 计算R2分数
r2 = r2_score(y_test, y_pred)
r2
Exercise 3
The original ice cream dataset contained an outlier, which was subsequently removed from the icecream.csv
file.
Run the cell below to load in the unmodified icecream dataset and display the relationship between temperature and number of ice creams.
import seaborn as sns
import pandas as pd
%matplotlib inline
icecream_outlier = pd.read_csv("data/icecream_outlier.csv")
sns.relplot(data=icecream_outlier,x="temp",y="ice_cream")
There appears to be one case where, possibly due to an imputation error, the number of ice creams is too high by an order of magnitude. This was dealt with originally by removing this record from the dataset, but suppose we want to treat this case by modifying the sample weight of this point.
We can down-weight the outlier relative to other records in the dataset, giving this outlier a weight 10 times lower than the remaining data.
(a) Create a new column called weight
in this DataFrame. Entries in this new column should always have value of 1, except where the number of ice creams is higher than 300, in which case the value should be 0.1. (1 mark)
#### ANSWER HERE
# 为数据集中的每一行分配权重1
icecream_outlier['weight'] = 1
# 为冰激凌销售量大于300的行分配权重0.1
icecream_outlier.loc[icecream_outlier['ice_cream'] > 300, 'weight'] = 0.1
(b) Setup and use a new linear regression model to predict ice cream sales using this dataset and print out the resulting R2 score. To do this you can copy your answers to Exercise 2 below, modifying your code to include the new weight
column as a sample weight for your data. (2 marks)
# 定义特征变量和目标变量
X_outlier = icecream_outlier[['temp']]
y_outlier = icecream_outlier['ice_cream']
# 从数据集中获取样本权重
sample_weight = icecream_outlier['weight']
# 使用train_test_split方法划分训练集和测试集
X_train_outlier, X_test_outlier, y_train_outlier, y_test_outlier = train_test_split(X_outlier, y_outlier, test_size=0.2, random_state=42)
# 创建LinearRegression类的实例
regressor_outlier = LinearRegression()
# 使用fit方法训练模型,并使用sample_weight参数为每个样本分配权重
regressor_outlier.fit(X_train_outlier, y_train_outlier, sample_weight=sample_weight[X_train_outlier.index])
# 使用predict方法预测测试数据集的目标变量
y_pred_outlier = regressor_outlier.predict(X_test_outlier)
# 使用r2_score方法计算R2分数
r2_outlier = r2_score(y_test_outlier, y_pred_outlier)
r2_outlier
#### ANSWER HERE
# 定义特征变量和目标变量
X_outlier = icecream_outlier[['temp']]
y_outlier = icecream_outlier['ice_cream']
# 从数据集中获取样本权重
sample_weight = icecream_outlier['weight']
X_train_outlier, X_test_outlier, y_train_outlier, y_test_outlier = train_test_split(X_outlier, y_outlier, test_size=0.2, random_state=42)
regressor_outlier = LinearRegression()
regressor_outlier.fit(X_train_outlier, y_train_outlier, sample_weight=sample_weight[X_train_outlier.index])
y_pred_outlier = regressor_outlier.predict(X_test_outlier)
r2_outlier = r2_score(y_test_outlier, y_pred_outlier)
r2_outlier
知识
当我们使用 train_test_split
方法划分数据集时,test_size
和 random_state
是两个常用的参数:
test_size:
test_size
表示测试集的大小。- 如果
test_size
是一个浮点数,则它代表数据集的比例。例如,test_size=0.2
表示测试集应包含原始数据集的20%,而训练集则包含剩余的80%。 - 如果
test_size
是一个整数,它表示测试集中的确切样本数量。
random_state:
random_state
是一个种子值,用于控制数据划分的随机性。- 使用相同的
random_state
可以确保多次运行代码时得到相同的数据划分。这在实验中很有用,因为它使结果可以复现。 - 如果不设置
random_state
,则每次调用train_test_split
时都可能得到不同的数据划分。 random_state
的具体值并不重要,只要在不同的实验中保持一致即可。42是一个经常被使用的传统值,但可以选择任何其他整数。
总之,test_size=0.2
表示我们将原始数据集的20%作为测试集,剩余的80%作为训练集。random_state=42
确保每次我们运行代码时都得到相同的数据划分。
公众号:AI悦创【二维码】

AI悦创·编程一对一
AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然,还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线,随时响应!微信:Jiabcdefh
C++ 信息奥赛题解,长期更新!长期招收一对一中小学信息奥赛集训,莆田、厦门地区有机会线下上门,其他地区线上。微信:Jiabcdefh
方法一:QQ
方法二:微信:Jiabcdefh
