AirBnB MongoDB Analysis
Instructions
A little assignment to practice importing and analyzing data within a MongoDB database.
The goals of this assignment are to:
- gain experience working with real world data sets
- import the text data file into MongoDB document-oriented database collection
- perform queries in MongoDB to gain insights present in the data
Part 1: Data selection and import
The data source
In this assignment, you will work with AirBnB listings data from a city/region of your choosing.
Save the data
Save the raw data file of your choice into the data
directory.
- Select a city or region of interest to you.
- Download the appropriate data file - it will be named
listings.csv.gz
- this is acsv
file compressed withgzip
compressino.
Upload the data file to i6
In order to import this data into a MongoDB database, we must first upload it to our web server from which we will connect to the MongoDB sever.
- Use a file transfer program, such as Cyberduck, to transfer this compressed file to your account on the i6.cims.nyu.edu web server.
- Log into the i6 server using
ssh
, and unzip this file with the command,gunzip listings.csv.gz
- this will result in a file namedlsitings.csv
in the same directory.
Data scrubbing
Take a look at the data in the data file. If you decide it needs any scrubbing, perform that scrubbing using any tool you wish. Save the scrubbed version in a file named listings_clean.csv
.
If you use any Python or other programs to help scrub the data, include those program files in the main directory of the project.
Keep track of any changes you make to the data while scrubbing - you will include these details in a report.
Import the data into a MongoDB collection
Use the following command to import the data into a MongoDB collection on the database server. This import may take a few minutes, if the file is very large.
mongoimport --headerline --type=csv --db=your_db_name --collection=listings --host=your_db_host --file=listings.csv --username=your_db_username --password=your_db_password
mongoimport --headerline --type=csv --db=zq2076 --collection=listings --host=class-mongodb.cims.nyu.edu --file=listings.csv --username=zq2076 --password=694996797Qzy2140
mongoimport --headerline --type=csv --db=zq2076 --collection=listings --host=127.0.0.1 --file=listings.csv
- replace
your_db_host
with the host name of your database,class-mongodb.cims.nyu.edu
. - replace
your_db_name
with your database name, your NYU Net ID. - replace
your_db_username
with your database username, your NYU Net ID. - replace
your_db_password
with your database password. - if you have scrubbed the data, replace the
listings.csv
file name with your scrubbed version,listings_clean.csv
.
Part 2: Data analysis in MongoDB
Use the mongo
command line client to run the following queries - save both the query and up to the first 3 result documents for each query to include in your report later.
- show exactly two documents from the
listings
collection in any order - show exactly 10 documents in any order, but "prettyprint" in easier to read format, using the
pretty()
function. - choose two hosts (by reffering to their
host_id
values) who are superhosts (available in thehost_is_superhost
field), and show all of the listings offered by both of the two hosts- only show the
name
,price
,neighbourhood
,host_name
, andhost_is_superhost
for each result
- only show the
- find all the unique
host_name
values (see the docs) - find all of the places that have more than 2
beds
in a neighborhood of your choice (referred to as either theneighborhood
orneighbourhood_group_cleansed
fields in the data file), ordered byreview_scores_rating
descending- only show the
name
,beds
,review_scores_rating
, andprice
- if your data set only has blanks for all the neighborhood-related fields, or only one neighborhood value in all documents, you may pick another field to filter by - include an explanation and justification for this in your report.
- if you run out of memory for this query, try filtering
review_scores_rating
that aren't empty ($ne
); and lastly, if there's still an issue, you can set thebeds
to match exactly 2.
- only show the
- show the number of listings per host
- find the average
review_scores_rating
per neighborhood, and only show the ones above a95
, sorted in descending order of rating (see the docs)- if your data set only has blanks in the neighborhood-related fields, or only one neighborhood value in all documents, you may pick another field to break down the listings by - include an explanation and justification for this in your report.
Write a report
Write a report which displays the data and the results in the file named README.md.
This report should be well-written and well-formatted using Markdown code - refer to this guide to using Markdown.
The report must include:
Data set details:
- The origin of your data set - what is it and where does it come from. Include a link to the URL of the source.
- What format the original data file was in (CSV, JSON, or other).
- Display some of the raw data from the original data file (the first 20 rows is enough - feel free to clip the text in fields to prevent line-wrapping). Use Markdown's ability to display tables - see the examples in the Markdown guide linked above.
- Describe any problems that were present in the data and the scrubbing tasks that were necessary to prepare your data set for import - include any scrubbing done in Python, a text editor, or any other tool. Be specific with examples of the problems in the original data and the way in which those were solved. Feel free to show small snippets of relevant code - see the examples of code "syntax highlighting" in the Markdown guide linked above.
Analysis:
- Describe each of the analyses you have performed. For each query, include:
- a description of the query
- the code used to perform it
- up to the first three results in a preformatted text block (feel free to clip the text in fields to prevent line-wrapping)
- describe any insights the analysis shows that may not be obvious to someone just viewing the raw data.
Part 3: Extra credit
For extra credit, you can optionally use Python to connect to the MongoDB database and perform some of the queries in your analysis.
Set up a Python virtual environment
A virtual environment is a sort of clean slate Python programming environment, within which you can install Python modules for your project in a way that does not conflict with other modules you or others have installed for other purposes.
On i6, create and activate a new virtual environment with the name .venv
:
python3 -m venv .venv
source .venv/bin/activate
Install the pymongo
module to allow Python to connect to MongoDB.
pip3 install pymongo
Set up your Python program
The code below shows how to connect from Python to a MongoDB database using pymongo
module. Replace the database credentials below with the correct values for your account:
import pymongo
connection = pymongo.MongoClient("your_db_host", 27017,
username="your_db_username",
password="your_db_password",
authSource="your_db_name")
collection = connection["your_db_name"]]["your_db_collection_name"]
# the collection variable will be a reference to your collection
docs = collection.find({}).limit(10) # get the first 10 documents
print(docs)
Read the pymongo docs to learn more.
Recreate analyses
Reproduce one of your earlier queries:
- find all of the places that have more than 2
beds
in a neighborhood of your choosing, ordered byreview_scores_rating
descending - only show the
name
,beds
,review_scores_rating
, andprice
- note that in
pymongo
, you'll have to quote all of your keys.
Requesting extra credit
If you believe you deserve extra credit, include a sub-heading at the bottom of your README.md
document explaining why you believe you deserve it.
...
### Extra-credit
This assignment deserves extra credit because iste numquam eos et repudiandae sint enim. Rerum enim voluptas voluptatem consequuntur. Sed atque deserunt nihil eius neque et provident aspernatur. Incidunt iusto beatae illo minus vel. Quis sint sunt et facilis doloribus eligendi error est. Ipsum similique.
...
Submit your work
Each student must submit this assignment individually. Use Visual Studio Code to perform git stage
, commit
and push
actions to submit. These actions are all available as menu items in Visual Studio Code's Source Control panel.
- Type a short note about what you have done to the files in the
Message
area, and then typeCommand-Enter
(Mac) orControl-Enter
(Windows) to perform gitstage
andcommit
actions. - Click the
...
icon next to the words, "Source Control" and select "Push" to perform the gitpush
action. This will upload your work to your repository on GitHub.com.
Be sure to include the following:
- The original plain text data file which you downloaded from the Internet. Place this within the
data
directory. - If you performed any scrubbing of the data, include the any Python programs you used for scrubbing the data file placed within the main project directory, as well as the scrubbed data file itself, named
listings_clean.csv
, also within thedata
directory. - your report in the file named README.md.
欢迎关注我公众号:AI悦创,有更多更好玩的等你发现!
公众号:AI悦创【二维码】
AI悦创·编程一对一
AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然,还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线,随时响应!微信:Jiabcdefh
C++ 信息奥赛题解,长期更新!长期招收一对一中小学信息奥赛集训,莆田、厦门地区有机会线下上门,其他地区线上。微信:Jiabcdefh
方法一:QQ
方法二:微信:Jiabcdefh
- 0
- 0
- 0
- 0
- 0
- 0