Project Documentation

AI悦创原创2024/11/22大约 6 分钟...约 1759 字

1. Project Documentation

The project is divided into two parts. The basic task is worth 100 points and contributes 10% to the total grade. There is also a bonus section worth 50 points, contributing an additional 5% to the total grade.

This project includes several tasks aimed at cleaning, modifying, and analyzing a news dataset.

Each task has a corresponding subcommand within the script, enabling different functionalities.
Below are details for each task and the corresponding grading criteria.

The final submission deadline for the project is November 24th. You should eventually submit a compressed package file of your project named like 24012345D.zip, and the package should not include the dataset and any CSV files generated by yourself.

2. Tips

Command-Line Script Execution: The project is graded by running the command-line script. To run the script, use the format:
```
python main.py <command> [options]
```
Be sure to thoroughly test your script using this format after completing your work to ensure it runs correctly for grading.
Handler Functions: Each required function is provided as a separate file in the handler folder.
Do not modify the function names or parameters, but you are free to implement the code as you see fit.
This time, you can write your code outside the handler functions but to make sure that the handler functions are working correctly.
File Paths: Ensure that you are using relative path. You also need to use the os.path module to handle file paths in a platform-adaptive manner.
Third-Party Libraries: You are allowed to use pandas, numpy, sklearn, and Python’s standard libraries. If you wish to use any other libraries, please obtain prior approval from the TA.
Example: The example folder contains sample input and output files for each task.

3. Task Descriptions and Commands

Task 1: Data Cleaner (25 Points)

Description: Merge all the small files in a specified folder into a single file named CBS_NEWs.csv,
removing duplicate entries and entries with missing values.

Command:

python main.py cleaner --input <input_folder> --output output.csv

--input: Path to the folder containing files to be merged.
--output: Path to save the cleaned, merged file.

Example Input/Output for `clean_and_merge_files`

This example illustrates the input and expected output files for the clean_and_merge_files function.
Assume that some rows are duplicated across files and students need to remove duplicates in the final merged file.
Also, the rows with any empty cells should be removed from the final merged file.
Please sort the rows in the output file by the date column in ascending order.

Input Files

Suppose there are two CSV files in the input_folder: file1.csv and file2.csv, each containing some overlapping rows.

file1.csv

date,title,content,publisher
2024-11-01,First Title,This is the first content.,Publisher A
2024-11-02,Second Title,This is the second content.,Publisher B
2024-11-03,Third Title,This is the third content.,Publisher C
2024-11-03,Third Title, ,Publisher C
 , , This is a stupid content.,Publisher X

file2.csv

date,title,content,publisher
2024-11-02, ,This is the second content., 
2024-11-03,Third Title,This is the third content.,Publisher C
2024-11-04,Fourth Title,This is the fourth content.,Publisher D

Expected Output File

After running clean_and_merge_files(<input_folder>, <output_folder>/merged_output.csv) and removing duplicates, the output file merged_output.csv should be as follows:

merged_output.csv

date,title,content,publisher
2024-11-01,First Title,This is the first content.,Publisher A
2024-11-02,Second Title,This is the second content.,Publisher B
2024-11-03,Third Title,This is the third content.,Publisher C
2024-11-04,Fourth Title,This is the fourth content.,Publisher D

详情

import pandas as pd
import numpy as np
import os


def clean_and_merge_files(input_folder, output_file):
    """
    Clean the dataset by removing empty lines and merge multiple files into one.
    :param input_folder: folder containing the files to be merged
    :param output_file: where the merged file is saved
    :return: None
    """
    all_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.endswith('.csv')]

    # 合并所有文件
    df_list = [pd.read_csv(file, sep=',') for file in all_files]
    merged_df = pd.concat(df_list, ignore_index=True)

    # 替换空字符串或仅含空格的单元格为 NaN
    for col in merged_df.columns:
        merged_df[col] = merged_df[col].map(lambda x: x if pd.notnull(x) and str(x).strip() != '' else np.nan)
    # 删除含有空值的行
    cleaned_df = merged_df.dropna(how='any')  # 确保所有列都非空

    # 删除重复行
    cleaned_df = cleaned_df.drop_duplicates()

    # 将 `date` 转换为时间格式并排序
    if 'date' in cleaned_df.columns:
        cleaned_df['date'] = pd.to_datetime(cleaned_df['date'], errors='coerce', utc=True)
        cleaned_df = cleaned_df.dropna(subset=['date'])  # 移除日期无效的行
        cleaned_df = cleaned_df.sort_values(by='date')  # 按日期升序排序
    # 保存到输出文件
    cleaned_df.to_csv(output_file, index=False)
    print(f"文件已成功保存到 {output_file}")

# 执行合并和清理
# input_folder = '../data/Task1/'
# output_file = '../data/Task1/out/file.csv'
# clean_and_merge_files(input_folder, output_file)
# python main.py cleaner --input dataset --output output.csv

Task 2: Data Modifier (25 Points)

Description: Retrieve news articles from the merged file for a specific date and publisher,
and prepend (Unverified News) to their titles.

Command:

python main.py modifier --input CBS_NEWs.csv --date <date> --publisher <publisher> --output <output_file>

--input: Path to the merged news file (CBS_NEWs.csv).
--date: Specific date (format: YYYY-MM-DD) to filter articles.
--publisher: Publisher’s name for filtering articles.
--output: Path to save the modified file.

Example Usage of `add_prefix_to_titles` Function

The add_prefix_to_titles function adds a prefix to each title in the input CSV file. The prefix in this case is (Unverified News), indicating that the content has not been verified. This example demonstrates the input and expected output files for using this function.

Parameters

input_file: Path to the CSV file containing titles that need a prefix.
date: Date (not used in the prefix in this version).
publisher: Publisher (not used in the prefix in this version).
output_file: Path where the modified file with prefixed titles will be saved.

Input File Example

Suppose we have the following input file input_file.csv:

input_file.csv

date,title,content,publisher
2024-11-01,First Title,This is the first content.,Publisher A
2024-11-02,Second Title,This is the second content.,Publisher B
2024-11-03,Third Title,This is the third content.,Publisher A

Function Call Example

To add the prefix, we call the function:

add_prefix_to_titles('input_file.csv', '2024-11-01', 'Publisher A', 'output_file.csv')

Expected Output File

After running the function, the output file output_file.csv should look like this:

output_file.csv

date,title,content,publisher
2024-11-01,(Unverified News) First Title,This is the first content.,Publisher A
2024-11-02, Second Title,This is the second content.,Publisher B
2024-11-03,(Unverified News) Third Title,This is the third content.,Publisher A

Task 3: Data Splitter (25 Points)

Description: Split the merged file based on a specified attribute (either publisher or year),
saving each subset into files named as publisher_name.csv or year.csv.

Command:

python main.py splitter --input CBS_NEWs.csv --attributes <attribute> --output <output_folder>

--input: Path to the merged file (CBS_NEWs.csv).
--attributes: Attribute to split by (either publisher or year).
--output: Path to save each subset file.

Task 4: Date Counter (25 Points)

Description: Count the number of news articles by date (format: YYYY-MM-DD) and identify the date with the highest number of articles.

Command:

python main.py counter --input <input_file> --output <output_file>

--input: Path to the merged file (e.g., CBS_NEWs.csv).
--output: Specific file path that contains the result.

4. Bonus Tasks

Bonus Task 1: TF-IDF Calculator (20 Points)

Description: Perform high-frequency word analysis on news content, returning the top K keywords within a specified date range.

Command:

python main.py tfidf --input CBS_NEWs.csv --start <start_date> --end <end_date> --topk <top_k> --output <output_file>

--input: Path to the merged file.
--start & --end: Date range for analysis (format: YYYY-MM-DD).
--topk: Number of top keywords to retrieve (k ≤ 5).
--output: Path to save the output file.

TF-IDF Explanation:

TF (Term Frequency): Measures how often a term appears in a document.
$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total terms in document } d}$
IDF (Inverse Document Frequency): Measures the importance of a term.
$\text{IDF}(t, D) = \log \frac{\text{Total documents } |D|}{\text{Documents with term } t}$
TF-IDF Calculation:
$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$

Bonus Task 2: Document Retriever (30 Points)

Description: Retrieve the top K most similar news articles to a given query within a specified date range, using TF-IDF for similarity calculation.

Command:

python main.py retriever --input CBS_NEWs.csv --query <query_file> --start <start_date> --end <end_date> --topk <top_k> --output <output_file>

--input: Path to the merged file.
--query: Path to the file containing example queries.
--start & --end: Date range for analysis.
--topk: Number of top similar articles to retrieve (k ≤ 5).
--output: Path to save the retrieved articles.

公众号：AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦，包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」，全部都是一对一教学：一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然，还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线，随时响应！微信：Jiabcdefh

C++ 信息奥赛题解，长期更新！长期招收一对一中小学信息奥赛集训，莆田、厦门地区有机会线下上门，其他地区线上。微信：Jiabcdefh

方法一：QQ

方法二：微信：Jiabcdefh

更新日志

2025/4/11 07:49

查看所有更新日志

1c35a-去掉head于 2025/4/11
aed17-启用编辑链接，提升文档的可用性和用户体验于 2025/3/30
e479e-/1v1/97-Loyal/于 2024/11/21

贡献者

AndersonHJB