House Prices - Advanced Regression Techniques

AI悦创原创2024/12/15大约 9 分钟...约 2690 字

Overview:

In this experiment report, you'll complete two parts in order:

First, complete a coding task (i.e., House Prices Prediction), and write iPython/Jupyter code based on the tools provided. You'll submit your code as a ZIP to the email address below.

Second, write a report based on your experimental results. This report should include experimental complete analysis, including data processing, model training, test evaluation, and references. You'll submit this Word report separately. This report should be approximately 1000 +/- 10% words.

Each student is required to submit the two parts. It will be good to have both theory and programming components in this task, to allow students to appreciate and learn both aspects of machine learning.

Turn in:
Word report turned in to: chxwei@cuc.edu.cn
ZIP file of source code turned in to: chxwei@cuc.edu.cn
Code Task：House Prices Prediction

Goal: It is your task to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
Dataset Description: The Kaggle House Price Prediction Dataset
kaggle_house_price_train.csv: the training and test sets.
data_description.txt: full description of each column.
Data link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Critically, to properly assess your model's forecasting ability, you will:

(1) Shuffle the data and take the first 70% as the training set and the remaining 30% as the test set.

(2) Use LotArea, BsmtUnfSF, and GarageArea as the model's input features respectively, and SalePrice as the model's output.

(3) On the training set, use the least squares method to solve for the model parameters (must be implemented by yourself, third-party libraries are not allowed).

(4) Calculate the MAE and RMSE metrics for the model on the test set (must be implemented by yourself, third-party libraries are not allowed).

(5) Plot the curves of the models on both the training set and the test set. Particularly, plot the fitt

https://colab.research.google.com/drive/1IquttISLsfYaz3iYKZ8CnBUrDYCwADIE?usp=sharing

# -*- coding: utf-8 -*-
"""肖羡月作业.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1IquttISLsfYaz3iYKZ8CnBUrDYCwADIE

# Step 1: 中文字体下载
"""

!apt-get -qq update
!apt-get -qq install fonts-noto-cjk

# !fc-list :lang=zh
!rm -rf ~/.cache/matplotlib

# import matplotlib.pyplot as plt
# from matplotlib.font_manager import FontProperties

# fp = FontProperties(fname='/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc')

# font_name = fp.get_name()
# plt.rcParams['font.family'] = font_name
# plt.rcParams['font.sans-serif'] = [font_name]
# plt.rcParams['axes.unicode_minus'] = False

# plt.title("中文标题测试", fontproperties=fp)
# plt.xlabel("横轴标签", fontproperties=fp)
# plt.ylabel("纵轴标签", fontproperties=fp)
# plt.plot([1, 2, 3], [4, 5, 6])
# plt.show()

"""# Step 2: 加载数据和预处理"""

import pandas as pd
import numpy as np

# 加载数据
train_path = 'https://github.com/AndersonHJB/AndersonHJB.github.io/releases/download/V0.05/10-train.csv'

# 读取训练数据
train_data = pd.read_csv(train_path)

# 检查训练数据前几行
train_data.head()

# 筛选关键字段
# 筛选需要的字段，去掉缺失值
# 仅选择题目要求的3个特征和目标变量
selected_features = ['LotArea', 'BsmtUnfSF', 'GarageArea', 'SalePrice']
data = train_data[selected_features].dropna()  # 删除包含缺失值的行

# 随机打乱数据
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# 按照 70% 划分训练集和测试集
split_index = int(len(data) * 0.7)
train_set = data[:split_index]  # 前70%作为训练集
test_set = data[split_index:]  # 后30%作为测试集

# 分离训练集和测试集的特征和目标变量
X_train = train_set[['LotArea', 'BsmtUnfSF', 'GarageArea']].values
y_train = train_set['SalePrice'].values
X_test = test_set[['LotArea', 'BsmtUnfSF', 'GarageArea']].values
y_test = test_set['SalePrice'].values

# 检查训练集和测试集的维度
# (X_train.shape, y_train.shape, X_test.shape, y_test.shape)
# 打印训练集和测试集维度
print(f"训练集特征维度: {X_train.shape}, 目标维度: {y_train.shape}")
print(f"测试集特征维度: {X_test.shape}, 目标维度: {y_test.shape}")

"""# Step 3:最小二乘法实现线性回归"""

# 实现最小二乘法
def least_squares(X, y):
    """
    计算线性回归参数 θ = (X^T X)^(-1) X^T y
    """
    # 添加偏置项
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    # 计算参数
    theta = np.linalg.inv(X.T @ X) @ X.T @ y  # 最小二乘法公式
    return theta

# 训练模型，得到参数 theta
theta = least_squares(X_train, y_train)

# 打印模型参数
print("训练得到的模型参数 θ:")
print(theta)

"""# Step 4: 模型预测函数"""

# 定义预测函数
def predict(X, theta):
    """
    基于线性模型 θ 预测目标变量 y
    """
    X = np.hstack((np.ones((X.shape[0], 1)), X))  # 添加偏置项
    return X @ theta

# 在训练集和测试集上进行预测
y_train_pred = predict(X_train, theta)
y_test_pred = predict(X_test, theta)

"""# Step 5:计算模型性能指标"""

# 定义性能指标计算函数
def mae(y_true, y_pred):
    """
    平均绝对误差 MAE
    """
    return np.mean(np.abs(y_true - y_pred))

def rmse(y_true, y_pred):
    """
    均方根误差 RMSE
    """
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

# 计算训练集和测试集上的 MAE 和 RMSE
mae_train = mae(y_train, y_train_pred)
rmse_train = rmse(y_train, y_train_pred)
mae_test = mae(y_test, y_test_pred)
rmse_test = rmse(y_test, y_test_pred)

# (mae_train, rmse_train, mae_test, rmse_test)
# 打印结果
print(f"训练集 MAE: {mae_train:.2f}, RMSE: {rmse_train:.2f}")
print(f"测试集 MAE: {mae_test:.2f}, RMSE: {rmse_test:.2f}")

"""# Step 6: 绘制拟合曲线"""

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

fp = FontProperties(fname='/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc')

# 绘制训练集和测试集拟合曲线
plt.figure(figsize=(12, 6))

# 训练集拟合曲线
plt.subplot(1, 2, 1)
plt.scatter(y_train, y_train_pred, alpha=0.5, label='预测值')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', label='真实值=预测值')
plt.xlabel("真实值", fontproperties=fp)
plt.ylabel("预测值", fontproperties=fp)
plt.title("训练集拟合曲线", fontproperties=fp)
plt.legend(prop=fp)

# 测试集拟合曲线
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred, alpha=0.5, label='预测值')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='真实值=预测值')
plt.xlabel("真实值", fontproperties=fp)
plt.ylabel("预测值", fontproperties=fp)
plt.title("测试集拟合曲线", fontproperties=fp)
plt.legend(prop=fp)

plt.tight_layout()
plt.show()

"""# Step 7: 使用 test.csv 生成预测结果"""

# 加载 test.csv 数据
test_path = 'https://github.com/AndersonHJB/AndersonHJB.github.io/releases/download/V0.05/09-test.csv'
test_data = pd.read_csv(test_path)

# 筛选 test.csv 中的特征
X_submit = test_data[['LotArea', 'BsmtUnfSF', 'GarageArea']].fillna(0).values  # 如果有缺失值，用0填充

# 对 test.csv 的数据进行预测
y_submit_pred = predict(X_submit, theta)

# 将结果保存为提交文件
submission = pd.DataFrame({
    'Id': test_data['Id'],  # 使用 test.csv 提供的房屋 ID
    'SalePrice': y_submit_pred  # 预测的房价
})

# 保存为 CSV 文件
submission_file = 'submission.csv'
submission.to_csv(submission_file, index=False)

print(f"预测结果已保存至 {submission_file}")

"""# 实验报告：房价预测实验

## 1. 引言

房价预测是一个经典的回归问题，在房屋销售领域具有重要的实际意义。本实验旨在预测房屋的销售价格 (**SalePrice**)，以提供科学的估值依据。实验基于 Kaggle 提供的房价预测数据集，运用线性回归算法完成预测任务。不仅关注模型的预测能力，还希望通过实验理解数据分析、特征选择和模型性能评估的重要性。

实验的目标包括：

1. 实现线性回归模型，预测房屋的销售价格；
2. 分析模型的预测能力，计算平均绝对误差 (MAE) 和均方根误差 (RMSE)；
3. 绘制预测结果的拟合曲线，直观展示模型的效果。

## 2. 数据分析与预处理

### 2.1 数据来源与字段选择

本实验的数据来自 Kaggle 房价预测数据集。训练数据集包含 81 个字段，其中 `LotArea`、`BsmtUnfSF` 和 `GarageArea` 被选定为模型输入特征，`SalePrice` 作为目标变量。选择这些字段的原因在于：

- `LotArea`：反映房屋的地块面积；
- `BsmtUnfSF`：反映房屋地下未装修面积；
- `GarageArea`：反映车库面积。

### 2.2 数据清洗与划分

数据集可能存在缺失值，因此我对选定字段进行了缺失值处理，删除包含缺失值的样本。接着，我随机打乱数据，并按照 70% 和 30% 的比例划分为训练集和测试集。

数据处理的具体步骤：

1. 筛选所需字段；
2. 删除缺失值；
3. 随机打乱数据；
4. 按比例划分训练集和测试集。

训练集用于模型参数的计算，测试集用于评估模型性能。

## 3. 模型实现与训练

### 3.1 最小二乘法原理

线性回归模型通过以下公式计算目标变量：

$$
\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
$$

为了计算参数 \(\theta\)，使用最小二乘法，最小化目标函数：

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^m (\hat{y}_i - y_i)^2
$$

通过推导可得参数的闭式解：

$$
\theta = (X^T X)^{-1} X^T y
$$

### 3.2 模型实现

实验中，我手动实现了最小二乘法，无需使用第三方库完成矩阵计算。模型训练后，得到了参数 $\theta$，用其对训练集和测试集进行预测。

## 4. 实验结果与分析

### 4.1 性能指标

为了量化模型的预测能力，我计算了以下两个指标：

1. **平均绝对误差 (MAE)**：

$$
MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
$$

2. **均方根误差 (RMSE)**：

$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

实验结果如下：

| 数据集   | MAE       | RMSE      |
|----------|-----------|-----------|
| 训练集   | 41,338.08 | 62,759.72 |
| 测试集   | 39,396.35 | 54,252.67 |

从结果可以看出，模型在训练集和测试集上的误差较为接近，说明模型的泛化能力较好。但 RMSE 相对较高，表明预测值在某些样本上的偏差较大。

### 4.2 拟合曲线

为了直观展示模型效果，我绘制了训练集和测试集的拟合曲线。

#### 4.2.1 训练集拟合曲线

训练集的拟合曲线展示了预测值与真实值的关系。大部分点分布在理想直线附近，说明模型在训练集上拟合效果较好。

#### 4.2.2 测试集拟合曲线

测试集的拟合曲线与训练集类似，说明模型在测试集上具有较好的预测能力，但部分点偏离理想直线，反映了模型的误差。

## 5. 结论与展望

### 5.1 总结

本实验完成了基于最小二乘法的房价预测任务，模型实现和评估均符合题目要求。实验结果表明，模型能够较好地拟合数据，性能指标在训练集和测试集上表现相近，验证了模型的泛化能力。

### 5.2 局限性

1. 特征选择仅限于 3 个字段，可能遗漏了对预测有显著影响的其他变量；
2. 使用线性回归模型可能无法捕捉房价与特征间的复杂非线性关系。

### 5.3 改进方向

1. 增加特征变量，如房屋年龄、地段等，提升模型的解释能力；
2. 引入非线性模型（如决策树、随机森林），进一步提高预测精度；
3. 使用交叉验证方法优化模型参数，避免数据划分带来的偏差。

通过本次实验，我不仅掌握了线性回归的基本原理与实现方法，还对数据分析和模型评估的重要性有了更深刻的理解。这为后续的机器学习实践奠定了坚实的基础。


"""

公众号：AI悦创【二维码】

C:::

AI悦创·编程一对一

AI悦创·推出辅导班啦，包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」，全部都是一对一教学：一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然，还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线，随时响应！微信：Jiabcdefh

C++ 信息奥赛题解，长期更新！长期招收一对一中小学信息奥赛集训，莆田、厦门地区有机会线下上门，其他地区线上。微信：Jiabcdefh

方法一：QQ

方法二：微信：Jiabcdefh

更新日志

2025/4/11 07:49

查看所有更新日志

1c35a-去掉head于 2025/4/11
aed17-启用编辑链接，提升文档的可用性和用户体验于 2025/3/30
73248-code于 2024/12/18
52905-import os于 2024/12/16
d6bcc-•Overview: In this experiment report, you'll complete two parts in order: First, complete a coding task (i.e., House Prices Prediction), and write iPython/Jupyter code based on the tools provided. You'll submit your code as a ZIP to the email address below. Second, write a report based on your experimental results. This report should include experimental complete analysis, including data processing, model training, test evaluation, and references. You'll submit this Word report separately. This report should be approximately 1000 +/- 10% words. Each student is required to submit the two parts. It will be good to have both theory and programming components in this task, to allow students to appreciate and learn both aspects of machine learning. •Turn in: Word report turned in to: chxwei@cuc.edu.cn ZIP file of source code turned in to: chxwei@cuc.edu.cn •Code Task：House Prices Prediction 1. Goal: It is your task to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 2. Dataset Description: The Kaggle House Price Prediction Dataset kaggle_house_price_train.csv: the training and test sets. data_description.txt: full description of each column. Data link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data 3. Critically, to properly assess your model's forecasting ability, you will: (1) Shuffle the data and take the first 70% as the training set and the remaining 30% as the test set. (2) Use LotArea, BsmtUnfSF, and GarageArea as the model's input features respectively, and SalePrice as the model's output. (3) On the training set, use the least squares method to solve for the model parameters (must be implemented by yourself, third-party libraries are not allowed). (4) Calculate the MAE and RMSE metrics for the model on the test set (must be implemented by yourself, third-party libraries are not allowed). (5) Plot the curves of the models on both the training set and the test set. Particularly, plot the fitt于 2024/12/15

贡献者

AndersonHJB