跳至主要內容

Final Project - IEOR E4501 - Fall 2023

AI悦创原创2023年11月16日大约 10 分钟...约 2964 字

Introduction

Welcome to our final project! The purpose of this project is to prepare you for the kind of research work you might find yourself doing after graduation. A lot of analytics work is working with large amounts of data and trying to make sense of it all.

One overarching goal of this project is to give you something to point to when applying for internships and job opportunities. You should feel welcomed to include a link to your project repository in your resumes and CVs.

Heads up!

This project will take about 2 to 3 times longer than the weekly homeworks. To help compensate, there will be no homework assigned for the last two modules. However, you are advised to start this project early.

There may be small updates like clarifications made to this document. Any updates made will be highlighted in pale yellow with a date of when it was made.

Logistics

Who

Your teams will be made up of 2 people according to this spreadsheet. This will be setup for you already in Courseworks.

What

You will be turning in a URL to the repository where your project code can be found. Provide the HTTPS link, not the link to clone the URL.1 We suggest using GitHub to host your repository, but you may also use BitBucket or GitLab.

If you make your repository private, please give the accounts “atisaxena”, “ZenanPeng”, and “econchick” access to your repository.

Optional: You may use this template repository (COMING SOON) to help give you an idea of how to start. One person in the group should click on the green button, “Use this template” and select “Create a new repository.”

See Requirements & Constraints for exactly what is needed and expected.

When

The project is due Sunday, December 10th at 11:59pm ET. No late submissions will be accepted.

5% additional extra credit will be awarded if projects are completed and submitted a week earlier, by Sunday, December 3rd at 11:59pm ET. See the extra credit section for more extra credit opportunities.

Problem Overview

Let’s say your apartment lease is ending at the end of the year, and you need to find a new apartment. There are a lot of criteria you can use to help find a neighborhood you’d like to live in. One thing you care a lot about is a quiet neighborhood with a lot of greenery.

Using NYC Open Data datasets and Zillow’s historic monthly rent averages, you will be creating a single Jupyter notebook to download, clean, and store data, as well as defining a set of SQL queries and visualizations to help answer questions of yours in search of a great area to live within your budget.

Datasets

Setup

PostgreSQL & PostGIS

This project includes storing and querying geographic data. To support this, you will use PostgreSQL (“Postgres”) instead of SQLite3. Using Postgres then allows us to use a popular extension for geographic data called PostGIS, which supports storing spatial data. Neither Postgres nor PostGIS come pre-installed on your computer, but it is very easy to install.

Install & setup PostgreSQL & PostGIS before starting your project in a notebook. Follow the instructions here.

NYC Open Data Account

Many municipalities publish datasets open to the public, including New York City. “Open data” is content made freely available to use and redistribute, so long as there is attribution to the original source.

While open data is free to download and work with, limitations are put on how much someone can download at one time. Data sources want to provide access to anyone who wants it, and throttles requests so that no one person “hogs” resources. Oftentimes, anonymous requests are restricted more so than authenticated requests.

The way that NYC Open Data approaches this is through creating an account and then creating what’s called an Application Token. This token is used during requests for the data to identify who or where the request is coming from.

Note: technically, only one person needs to set up an NYC Open Data account for your group’s project. You may both create accounts, but you only need to use one when executing requests to download data in Part 1.

First, create a free account. Sign up with any email address, it doesn’t have to be your Columbia email.

Once you’ve signed up, login, and go to your developer settings. Click on the button that says “Create New App Token” (not Create new API Key). Fill in the Application Name - it can be anything, like “Final Project Fall 2023”. Then fill in the Description - it can also be anything, like “Project for Fall 2023 for Tools for Analytics”. Do not fill out any other fields. Click Save.

Copy the value of your App Token, it should be roughly a 25-character string of random letters and numbers. Set this string to a variable in your project’s notebook so it’s easy to use when you need it.

When you’re ready to download the required datasets, learn how to use your Application Token here.

Specifications

Read these specifications closely. Failure to follow them will result in points being deducted.

Create a single Jupyter Notebook that is divided into four parts: Data Preprocessing, Storing Data, Understanding Data, and Visualizing Data.

For each part, use Markdown cells to describe what is going on. Construct and write your notebook as if you could hand it to a (not necessarily Python-inclined) friend, and they could understand it without needing to read the actual code.

When implementing your solutions to each part, write separate functions or specific tasks that help accomplish the goals of each part. This will make it easier to test different parts of the logic independently. Then in separate cells, execute the functions (so, don’t just define functions; call them as well). I should be able to clear the output, then run all cells successfully to generate everything needed below.

Of course, you’re encouraged to experiment, understand, and play around with the data and your code’s implementation. Be as messy as you need. Create multiple notebooks if you want as you’re exploring. The final result should be clean, readable, and tell a logical story in a single notebook.

Part 1: Data Preprocessing

Overview: For Part 1, you will be downloading Parquet files (both by hand and with Python code), cleaning and filtering for the relevant data, filling in missing data, and generating samples of these datasets.

Downloading: For the zip code data and Zillow data, manually download the data from the links above and save it to a directory called data lives alongside your notebook. For the 311 and tree data, you will need to programmatically download the datasets in your notebook using the Application Token you set up earlier.

Cleaning & filtering: You must use the pandas and geopandas packages to load and clean all the datasets. The process of cleaning & filtering the data should include:

Tips for Part 1:

Part 2: Storing Data

Overview: For Part 2, you will be taking the datasets downloaded & cleaned from Part 1, and populating a PostgreSQL database with tables generated from the datasets.

First, create a new database in PostgreSQL with which you’ll load in your preprocessed datasets, and turn on the PostGIS extension. Do so by running the following commands in a Jupyter Notebook - one command per Notebook cell, filling in YOUR_DATABASE_NAME with the desired name for your database:

!createdb YOUR_DATABASE_NAME
!psql --dbname YOUR_DATABASE_NAME -c 'CREATE EXTENSION postgis;'

Both createdb and psql, are commands that have been installed along with the PostgreSQL installation

Create a schema.sql file that defines each table’s schema. You can use SQLAlchemy within the notebook to help generate this file, another programmatic approach, or create this schema file by hand.

Finally, programmatically in your notebook, create & populate four tables: one for NYC zip codes, one for 311 complaints, one for trees, and one for historical average rents. Use appropriate data types for each column.

Tips for Part 2:

!createdb groupNproject
!psql --dbname groupNproject -c 'CREATE EXTENSION postgis;'
!psql --dbname groupNproject -f schema.sql

Part 3: Understanding Data

Overview: In Part 3, you will be crafting a set of SQL queries to develop a better understanding of the datasets we’re working with.

For this part, define a SQL query for each of the following questions - one query per question. Save each query as a .sql file, naming it something illustrative of what the query is for, e.g. top_10_zipcodes_by_population.sql.

For each query, be sure to execute it in the notebook so we can see your answers to the question.

The query results can be either a pandas/geopandas dataframe, or simple Python types (like a list of tuples).

Note: we will not be checking for exactness with the results. We’re only concerned with the construction of your queries.

Query 1: Which area might be more calm to live in?

Between October 1st, 2022 and September 30th, 2023 (inclusive), find the number of 311 complaints per zip code.

The query result should have two columns, one row per zip code, with the number of complaints in descending order.

Query 2: Where has the most greenery?

Using just the trees table, which 10 zip codes have the most trees?

The query result should have two columns, 10 rows. The rows should be sorted by the total number of trees, descending.

Query 3: Can I afford a place in the areas with the most trees?

Of the 10 zip codes with the most trees, for the month of August 2023, what is the average rent by zip code?

The query should have a JOIN statement. The query result should have two columns (not three) and 10 rows. The rows should be sorted by the total number of trees, descending. “Humanize” the rent numbers, meaning format the results as 2,879.58 instead of 2879.575128.

Query 4: Could there be a correlation between an area’s rent, the number of its trees, and the number of 311 complaints?

For the month of January 2023, return the 5 zip codes with the lowest average rent, and 5 zipcodes of the highest average rent, and include the tree count and complaint count for each zip code by using JOIN statements.

The query result should have 4 columns (zip code, average rent, tree count, and complaint count) and 10 rows: five with the highest average rent, and five with the lowest average rent. “Humanize” the rent numbers, meaning format the results as 2,879.58 instead of 2879.575128.

Query 5: Where has the most greenery (take 2)?

Rewrite Query 2 to use both the trees table and the zipcodes table. Join both tables where the coordinate point of the tree is inside the polygon boundary of the zipcode as defined in the zipcode table.

The query should have a JOIN statement. The query results should match exactly the results of Query 2.

Query 6: What is the immediate area like?

Using the following coordinate pair on campus, which trees are within 1⁄2 mile radius of this point?

Latitude: 40.80737875669467, Longitude: -73.96253174434912

The result should have 5 columns (ID, species, health, status, and coordinate location of each tree).

You may use regular Python to help construct the geometric object needed for setting the radius in the query.

Tips for Part 3:

ParameterDescriptionDefaultIn $query
$selectThe set of columns to be returned, similar to a SELECT in SQLAll columns, equivalent to $select=*SELECT
$whereFilters the rows to be returned, similar to WHERENo filterWHERE
$orderColumn to order results on, similar to ORDER BY in SQLUnspecified orderORDER BY
$groupColumn to group results on, similar to GROUP BY in SQLNo groupingGROUP BY
$havingFilters the rows that result from an aggregation, similar to HAVINGNo filterHAVING
$limitMaximum number of results to return1000 (2.0 endpoints: maximum of 50,000; 2.1: unlimited »)LIMIT
$offsetOffset count into the results to start at, used for paging0OFFSET
$qPerforms a full text search for a value.No searchN/A
$queryA full SoQL query string, all as one parameterN/AN/A
$$bomPrepends a UTF-8 Byte Order Mark to the beginning of CSV outputfalseN/A
公众号:AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然,还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线,随时响应!微信:Jiabcdefh

C++ 信息奥赛题解,长期更新!长期招收一对一中小学信息奥赛集训,莆田、厦门地区有机会线下上门,其他地区线上。微信:Jiabcdefh

方法一:QQ

方法二:微信:Jiabcdefh

你认为这篇文章怎么样?
  • 0
  • 0
  • 0
  • 0
  • 0
  • 0
评论
  • 按正序
  • 按倒序
  • 按热度
通知
关于编程私教&加密文章