By Josh Lavitz and Max Nguyen
Note that we hyperlink to additional resources throughout this tutorial that may be useful in explaining terms and things more thoroughly!
Airbnb is a company that provides short-term rentals for people to stay in. Airbnb has helped host over 800 million people, with 5.6 million currently active listings over 220 regions around the world. Even despite the challenges to travel and tourism posed by the COVID-19 pandemic, the company has still managed to adapt, continuing to make a profit of $219 million. As work transitions to be remote, people are using Airbnbs to get away from home and do their work from anywhere. Some people are also using Airbnb's as temporary places to socially distance or quarantine.
In this tutorial, we will explore what may affect the prices of Airbnb rentals and how Airbnb prices have changed since the pandemic, if at all. The cost of rentals is a major consideration for people, and economic considerations are perhaps even more important during these times. Specifically, we will be looking at data for rentals in Boston, which is a major city that was recently ranked as one of the three best largest cities in America. We will be comparing the data of listings from October 2020 to the listings from October 2019, as October 2019 was before the pandemic struck the U.S. This data exploration will hopefully be of interest to anyone looking to stay in an Airbnb during this time or in the future, once the circumstances are better.
In this tutorial, we use Python 3 and a few helpful Python libraries. We list and briefly describe the primary ones below:
Our data comes from Inside Airbnb, which scrapes data from the public listings on the Airbnb website. Since Inside Airbnb has already done the web scraping for us to produce a dataset of listing, we do not have to do much complicated work. As previously mentioned, we are looking at the rentals for Boston, the capital city of Massachusetts. To compare how prices have changed since the pandemic has started, we will compare the data set that was scraped from Airbnb on October 24, 2020 to the data scraped in October 19, 2019, which is almost exactly a year apart.
From the CSV files for October 2020 and October 2019 provided by Inside Airbnb, we read the data into two separate Pandas dataframes (which are like tables) so that the data can be easily accessed and manipulated.
import pandas
# Uses the built-in function to read in the 2019 data from the CSV file
df_2019 = pandas.read_csv("https://raw.githubusercontent.com/joshlavitz/joshlavitz.github.io/main/listings2019.csv")
df_2019 # Displays the dataframe
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | instant_bookable | is_business_travel_ready | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | https://www.airbnb.com/rooms/3781 | 20191018230017 | 2019-10-19 | HARBORSIDE-Walk to subway | Fully separate apartment in a two apartment bu... | This is a totally separate apartment located o... | Fully separate apartment in a two apartment bu... | none | Mostly quiet ( no loud music, no crowed sidewa... | ... | f | f | super_strict_30 | f | f | 2 | 2 | 0 | 0 | 0.29 |
1 | 5506 | https://www.airbnb.com/rooms/5506 | 20191018230017 | 2019-10-19 | **$99 Special ** Private! Minutes to center! | Private guest room with private bath, You do n... | **THE BEST Value in BOSTON!!*** PRIVATE GUEST ... | Private guest room with private bath, You do n... | none | Peacful, Architecturally interesting, historic... | ... | t | f | strict_14_with_grace_period | f | f | 6 | 6 | 0 | 0 | 0.80 |
2 | 6695 | https://www.airbnb.com/rooms/6695 | 20191018230017 | 2019-10-19 | $99 Special!! Home Away! Condo | NaN | ** WELCOME *** FULL PRIVATE APARTMENT In a His... | ** WELCOME *** FULL PRIVATE APARTMENT In a His... | none | Peaceful, Architecturally interesting, histori... | ... | t | f | strict_14_with_grace_period | f | f | 6 | 6 | 0 | 0 | 0.89 |
3 | 6976 | https://www.airbnb.com/rooms/6976 | 20191018230017 | 2019-10-19 | Mexican Folk Art Showcase in Boston Neighborhood | Come stay with me in Boston's Roslindale neigh... | This is a well-maintained, two-family house bu... | Come stay with me in Boston's Roslindale neigh... | none | The LOCATION: Roslindale is a safe and diverse... | ... | f | f | moderate | t | f | 1 | 0 | 1 | 0 | 0.66 |
4 | 8789 | https://www.airbnb.com/rooms/8789 | 20191018230017 | 2019-10-18 | Curved Glass Studio/1bd facing Park | Bright, 1 bed with curved glass windows facing... | Fully Furnished studio with enclosed bedroom. ... | Bright, 1 bed with curved glass windows facing... | none | Beacon Hill is a historic neighborhood filled ... | ... | f | f | strict_14_with_grace_period | f | f | 10 | 10 | 0 | 0 | 0.38 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5642 | 39461104 | https://www.airbnb.com/rooms/39461104 | 20191018230017 | 2019-10-19 | Convenient North End Studio w/ W/D + Gym near ... | Show up and start living from day one in Bosto... | Gorgeous furniture, fully-equipped kitchen, sm... | Show up and start living from day one in Bosto... | none | This furnished apartment is located in the Nor... | ... | t | f | flexible | f | f | 92 | 92 | 0 | 0 | NaN |
5643 | 39461138 | https://www.airbnb.com/rooms/39461138 | 20191018230017 | 2019-10-19 | Equipped North End Studio w/ W/D (BOS128) | Show up and start living from day one in Bosto... | Gorgeous furniture, fully-equipped kitchen, sm... | Show up and start living from day one in Bosto... | none | This furnished apartment is located in the Nor... | ... | t | f | flexible | f | f | 92 | 92 | 0 | 0 | NaN |
5644 | 39461190 | https://www.airbnb.com/rooms/39461190 | 20191018230017 | 2019-10-19 | Comfy North End Studio w/ Doorman + W/D near T... | Show up and start living from day one in Bosto... | Thoughtfully designed with bespoke finishes, m... | Show up and start living from day one in Bosto... | none | This furnished apartment is located in the Nor... | ... | t | f | flexible | f | f | 92 | 92 | 0 | 0 | NaN |
5645 | 39461223 | https://www.airbnb.com/rooms/39461223 | 20191018230017 | 2019-10-19 | Bespoke North End Studio w/ Gym + W/D near Nor... | Discover the best of Boston, with this studio ... | Thoughtfully designed with bespoke finishes, m... | Discover the best of Boston, with this studio ... | none | This furnished apartment is located in the Nor... | ... | t | f | flexible | f | f | 92 | 92 | 0 | 0 | NaN |
5646 | 39462969 | https://www.airbnb.com/rooms/39462969 | 20191018230017 | 2019-10-19 | Your Home in Back Bay! | Located on the corner of Gloucester & Newbury ... | The apartment is on the third floor - and ther... | Located on the corner of Gloucester & Newbury ... | none | The neighborhood is just fantastic! Five minut... | ... | f | f | flexible | f | f | 1 | 1 | 0 | 0 | NaN |
5647 rows × 106 columns
# Uses the built-in function to read in the 2020 data from the CSV file
df_2020 = pandas.read_csv("https://raw.githubusercontent.com/joshlavitz/joshlavitz.github.io/main/listings.csv")
df_2020 # Displays the dataframe
id | listing_url | scrape_id | last_scraped | name | description | neighborhood_overview | picture_url | host_id | host_url | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | https://www.airbnb.com/rooms/3781 | 20201024170420 | 2020-10-24 | HARBORSIDE-Walk to subway | Fully separate apartment in a two apartment bu... | Mostly quiet ( no loud music, no crowed sidewa... | https://a0.muscache.com/pictures/24670/b2de044... | 4804 | https://www.airbnb.com/users/show/4804 | ... | 10.0 | 10.0 | 10.0 | NaN | f | 1 | 1 | 0 | 0 | 0.26 |
1 | 5506 | https://www.airbnb.com/rooms/5506 | 20201024170420 | 2020-10-24 | **$49 Special ** Private! Minutes to center! | Private guest room with private bath, You do n... | Peacful, Architecturally interesting, historic... | https://a0.muscache.com/pictures/1598e8b6-5a55... | 8229 | https://www.airbnb.com/users/show/8229 | ... | 10.0 | 9.0 | 10.0 | Exempt: This listing is a unit that has contra... | f | 6 | 6 | 0 | 0 | 0.76 |
2 | 6695 | https://www.airbnb.com/rooms/6695 | 20201024170420 | 2020-10-24 | $99 Special!! Home Away! Condo | Comfortable, Fully Equipped private apartment... | Peaceful, Architecturally interesting, histori... | https://a0.muscache.com/pictures/38ac4797-e7a4... | 8229 | https://www.airbnb.com/users/show/8229 | ... | 10.0 | 9.0 | 10.0 | STR-404620 | f | 6 | 6 | 0 | 0 | 0.84 |
3 | 10730 | https://www.airbnb.com/rooms/10730 | 20201024170420 | 2020-10-24 | Bright 1bed facing Golden Dome | Bright, spacious unit, new galley kitchen, new... | Beacon Hill is located downtown and is conveni... | https://a0.muscache.com/pictures/miso/Hosting-... | 26988 | https://www.airbnb.com/users/show/26988 | ... | 10.0 | 10.0 | 9.0 | NaN | f | 7 | 7 | 0 | 0 | 0.24 |
4 | 10813 | https://www.airbnb.com/rooms/10813 | 20201024170420 | 2020-10-24 | Back Bay Apt-blocks to subway, Newbury St, The... | Stunning Back Bay furnished studio apartment. ... | Wander around this quintessential neighborhood... | https://a0.muscache.com/pictures/20b5b9c9-e1f4... | 38997 | https://www.airbnb.com/users/show/38997 | ... | 10.0 | 10.0 | 10.0 | NaN | f | 11 | 11 | 0 | 0 | 0.94 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3249 | 46021420 | https://www.airbnb.com/rooms/46021420 | 20201024170420 | 2020-10-24 | Stunning 1BR in Downtown + 100 WalkScore | Evo... | Whether you are just getting away for the week... | Downtown’s Theater District bustles with energ... | https://a0.muscache.com/pictures/956c6254-61ea... | 212359760 | https://www.airbnb.com/users/show/212359760 | ... | NaN | NaN | NaN | Exempt: This listing is a unit used for furnis... | t | 43 | 43 | 0 | 0 | NaN |
3250 | 46021809 | https://www.airbnb.com/rooms/46021809 | 20201024170420 | 2020-10-24 | Spacious and Modern 2BD in the Heart of Boston | This is a modern 2 bed in The Heart of Boston<... | NaN | https://a0.muscache.com/pictures/70f28a8d-36e0... | 2356643 | https://www.airbnb.com/users/show/2356643 | ... | NaN | NaN | NaN | NaN | t | 11 | 11 | 0 | 0 | NaN |
3251 | 46022872 | https://www.airbnb.com/rooms/46022872 | 20201024170420 | 2020-10-24 | Room in Large Brookline House, Phenomenal Loca... | Room A in 7 Bed, 3 Bath<br />Extremely spaciou... | Just off Harvard Ave, connecting Packards Corn... | https://a0.muscache.com/pictures/2114bef5-443a... | 373050156 | https://www.airbnb.com/users/show/373050156 | ... | NaN | NaN | NaN | NaN | t | 2 | 0 | 2 | 0 | NaN |
3252 | 46024344 | https://www.airbnb.com/rooms/46024344 | 20201024170420 | 2020-10-24 | Furnished Room, Big Brookline House, Top Location | Room C in 7 Bed, 3 Bath apartment<br />Extreme... | Just off Harvard Ave, connecting Packards Corn... | https://a0.muscache.com/pictures/2114bef5-443a... | 373050156 | https://www.airbnb.com/users/show/373050156 | ... | NaN | NaN | NaN | NaN | t | 2 | 0 | 2 | 0 | NaN |
3253 | 46025053 | https://www.airbnb.com/rooms/46025053 | 20201024170420 | 2020-10-24 | A place of your own | Studio in Boston | Stay for 30+ nights (minimum nights and rates ... | NaN | https://a0.muscache.com/pictures/8860911a-df51... | 359229620 | https://www.airbnb.com/users/show/359229620 | ... | NaN | NaN | NaN | NaN | t | 177 | 177 | 0 | 0 | NaN |
3254 rows × 74 columns
From the displayed dataframes, we can see that there were 3,254 active listings in Boston on October 24, 2020 and 5,647 active listings on October 19, 2019. This means that there were about 2,000 less listings in October 2020, which makes sense considering the challenges with hosting guests in a rental during a pandemic.
Each row corresponds to the information for a specific listing. There are a lot of different columns, some with very long names, and we will not be using all of them for our analysis. Some of the notable columns include:
import json
import requests
# Using the requests library, we can get the data and then parse it as a json file with the json library
url = 'https://raw.githubusercontent.com/joshlavitz/joshlavitz.github.io/main/neighbourhoods.geojson'
county_geo = requests.get(url).json()
Now that we have read our data from the CSV files into two Pandas dataframe, we have to do some cleaning and curation, which will make our later analysis much easier.
First, we drop the unncessary columns in the 2020 dataframe that contain information that we are not using in our analysis. The columns we keep are the same ones as listed above: neighbourhood_cleansed, latitude, longitude, room_type, number_of_reviews, review_scores_rating, amenities, and price since we are most interested in how those variables may affect rental price. We also keep the id column so that it is easier to differentiate rentals, even though it does not really have a relation to price.
A lot of the other columns concern things like host information or the date of the first review on the listing, which are not as relevant to our analysis. For example, there is likely not much of a relation between the name of the host or the URL of the rental listing to the price, so it does not make sense to keep these columns! We also shorten the 'neighbourhood_cleansed' column to simply 'neighborhood'.
# Defines the dataframe to only be the columns we want to keep
df_2020 = df_2020[['id', 'neighbourhood_cleansed',
'latitude', 'longitude',
'room_type', 'number_of_reviews',
'review_scores_rating',
'amenities', 'price']]
# Renames the columns to have a shorter name
df_2020.rename(columns={'neighbourhood_cleansed':'neighborhood',
'number_of_reviews':'num_reviews',
'review_scores_rating':'rating'}, inplace=True)
df_2020
id | neighborhood | latitude | longitude | room_type | num_reviews | rating | amenities | price | |
---|---|---|---|---|---|---|---|---|---|
0 | 3781 | East Boston | 42.364130 | -71.029910 | Entire home/apt | 17 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $150.00 |
1 | 5506 | Roxbury | 42.329810 | -71.095590 | Entire home/apt | 107 | 95.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $145.00 |
2 | 6695 | Roxbury | 42.329940 | -71.093510 | Entire home/apt | 115 | 96.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $169.00 |
3 | 10730 | Downtown | 42.358400 | -71.061850 | Entire home/apt | 32 | 96.0 | ["Cable TV", "Smoke alarm", "TV", "Bed linens"... | $81.00 |
4 | 10813 | Back Bay | 42.350610 | -71.087870 | Entire home/apt | 10 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $87.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3249 | 46021420 | Beacon Hill | 42.353290 | -71.065380 | Entire home/apt | 0 | NaN | ["Shower gel", "Shampoo", "Smoke alarm", "TV",... | $239.00 |
3250 | 46021809 | Roxbury | 42.330500 | -71.071270 | Entire home/apt | 0 | NaN | ["Air conditioning", "Heating", "Laptop-friend... | $47.00 |
3251 | 46022872 | Allston | 42.347372 | -71.130569 | Private room | 0 | NaN | ["Hangers", "Heating", "Laptop-friendly worksp... | $44.00 |
3252 | 46024344 | Allston | 42.348080 | -71.129930 | Private room | 0 | NaN | ["Hangers", "Heating", "Laptop-friendly worksp... | $44.00 |
3253 | 46025053 | East Boston | 42.371010 | -71.043770 | Entire home/apt | 0 | NaN | ["BBQ grill", "Shampoo", "Smoke alarm", "TV", ... | $147.00 |
3254 rows × 9 columns
Our dataframe is much more readable and manageable now that we have removed the less relevant columns! Now we can focus on the variables that we will actually be analyzing.
Now, we want to merge the two separate dataframes from 2020 and 2019 into one dataframe so that we can have columns showing the rental's price in 2019 and in 2020 in a single dataframe.
We will be doing an inner join, so we will only be keeping the rentals with the same unique id that were active in both 2019 and 2020 and excluding rentals that were only active in 2019 or only active in 2020. That way, we can easily compare how a rental changed its price from 2019 to 2020.
# Drops unnecessary columns from the 2019 dataframe since we are mainly concerned about adding the 2019 prices as a column
df_2019 = df_2019[['id', 'price']]
# Merges based on rentals with the same id
df = df_2020.merge(df_2019, how="inner", right_on=['id'], left_on=['id'])
# Renames the columns appropriately
df.rename(columns={'price_x':'price_2020', 'price_y':'price_2019'}, inplace=True)
df
id | neighborhood | latitude | longitude | room_type | num_reviews | rating | amenities | price_2020 | price_2019 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | East Boston | 42.36413 | -71.02991 | Entire home/apt | 17 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $150.00 | $125.00 |
1 | 5506 | Roxbury | 42.32981 | -71.09559 | Entire home/apt | 107 | 95.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $145.00 | $145.00 |
2 | 6695 | Roxbury | 42.32994 | -71.09351 | Entire home/apt | 115 | 96.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $169.00 | $169.00 |
3 | 10730 | Downtown | 42.35840 | -71.06185 | Entire home/apt | 32 | 96.0 | ["Cable TV", "Smoke alarm", "TV", "Bed linens"... | $81.00 | $150.00 |
4 | 10813 | Back Bay | 42.35061 | -71.08787 | Entire home/apt | 10 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | $87.00 | $179.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2053 | 39445807 | Back Bay | 42.34645 | -71.07803 | Entire home/apt | 2 | 100.0 | ["Shampoo", "Smoke alarm", "TV", "Bed linens",... | $125.00 | $200.00 |
2054 | 39446774 | Back Bay | 42.34663 | -71.07915 | Entire home/apt | 1 | 100.0 | ["Shower gel", "Cable TV", "Shampoo", "Smoke a... | $148.00 | $245.00 |
2055 | 39447297 | Back Bay | 42.34635 | -71.07792 | Entire home/apt | 0 | NaN | ["Garden or backyard", "Shampoo", "Smoke alarm... | $148.00 | $245.00 |
2056 | 39447462 | Back Bay | 42.34603 | -71.07920 | Entire home/apt | 0 | NaN | ["Shampoo", "Smoke alarm", "TV", "Private entr... | $148.00 | $245.00 |
2057 | 39447565 | Back Bay | 42.34834 | -71.08152 | Entire home/apt | 1 | 100.0 | ["Shampoo", "Smoke alarm", "TV", "Baking sheet... | $148.00 | $245.00 |
2058 rows × 10 columns
We now have a combined dataframe showing the price for each rental in 2019 and in 2020. We can see that about 2,058 rentals were active in both 2019 and 2020.
We convert the price columns into type int, so that they can be treated as numbers. Currently, the prices are written as strings like '$1,000' so we remove the dollar sign and comma to result in '1000' as an example.
# Iterates over every row
for index, row in df.iterrows():
price_2020 = float(row['price_2020'][1:].replace(',','')) # Removes the $ in the beginning of the price
price_2019 = float(row['price_2019'][1:].replace(',',''))
df.at[index, 'price_2020'] = price_2020
df.at[index, 'price_2019'] = price_2019
df['price_2020'] = df['price_2020'].astype(int) # Ensures the column of prices are treated like floats
df['price_2019'] = df['price_2019'].astype(int) # Ensures the column of prices are treated like floats
df.head() # Displays the first 5 listings
id | neighborhood | latitude | longitude | room_type | num_reviews | rating | amenities | price_2020 | price_2019 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | East Boston | 42.36413 | -71.02991 | Entire home/apt | 17 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | 150 | 125 |
1 | 5506 | Roxbury | 42.32981 | -71.09559 | Entire home/apt | 107 | 95.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | 145 | 145 |
2 | 6695 | Roxbury | 42.32994 | -71.09351 | Entire home/apt | 115 | 96.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | 169 | 169 |
3 | 10730 | Downtown | 42.35840 | -71.06185 | Entire home/apt | 32 | 96.0 | ["Cable TV", "Smoke alarm", "TV", "Bed linens"... | 81 | 150 |
4 | 10813 | Back Bay | 42.35061 | -71.08787 | Entire home/apt | 10 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV", "... | 87 | 179 |
The price columns are now properly encoded as integers! Note, that all of the prices were whole numbers, which is why we did not have to worry about converting to type float.
We want to do some analysis to see if the number of amenities provided by a rental affects the price. Since the dataset has a column listing the amenities provided, we want to add a column that counts the number of amenities provided from that list. However, since the list of amenities is currently stored as a string, we need to do some pre-processing so that it is easier to count the number of amenities.
# Iterates over every row
for index, row in df.iterrows():
# Removes the extraneous characters from the amenities list
row['amenities'] = row['amenities'].replace('[','').replace(']','').replace('"','')
# Converts the string of words into an actual list
df['amenities'] = df.amenities.apply(lambda x: x[1:-1].split(','))
# Adds a column that contains the length of the list of amenities
df['num_amenities'] = [len(amen_list) for amen_list in df['amenities']]
df.head() # Displays the first 5 listings
id | neighborhood | latitude | longitude | room_type | num_reviews | rating | amenities | price_2020 | price_2019 | num_amenities | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | East Boston | 42.36413 | -71.02991 | Entire home/apt | 17 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 150 | 125 | 30 |
1 | 5506 | Roxbury | 42.32981 | -71.09559 | Entire home/apt | 107 | 95.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 145 | 145 | 30 |
2 | 6695 | Roxbury | 42.32994 | -71.09351 | Entire home/apt | 115 | 96.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 169 | 169 | 30 |
3 | 10730 | Downtown | 42.35840 | -71.06185 | Entire home/apt | 32 | 96.0 | ["Cable TV", "Smoke alarm", "TV", "Bed line... | 81 | 150 | 30 |
4 | 10813 | Back Bay | 42.35061 | -71.08787 | Entire home/apt | 10 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 87 | 179 | 23 |
We now have a column at the end of the dataset showing the number of amenities provided by each listing.
Lastly, we need to remove any rows with missing values to avoid errors later on.
df.dropna(inplace=True)
Now that we have finished with our Data Cleaning and Curation, we can move on to doing Exploratory Data Analysis. During this phase, we try to gain some more insight into the dataset through visualizations and also determine if there are any outliers that we need to account for. Primarily, we are trying to see whether room type, neighborhood, rating, number of amenities, or number of reviews could be used to predict price based on whether price varies based on those variables.
In order to get a better sense of the location of rentals, we produce a map with Marker Clusters using the Folium package. Here is a helpful walkthrough of using Folium to produce map visualizations. We can plot the rentals as clusters where we can zoom in on a specific cluster to see the individual marker locations. By clicking on a marker, a popup will appear that displays the URL to the listing, which can be typed into a browser to go to the listing's page. (Note, that it is possible for some of the listings to no longer be active, but most of them should be.)
import folium
from folium.plugins import MarkerCluster
from folium import Marker
location_map = folium.Map(location=[42.3, -71.057083], zoom_start=12, tiles='cartodbpositron') # Defines the base map
clusters = MarkerCluster()
# Iterates over ever listing in our table
for index, row in df.iterrows():
# Adds a marker for the listing at the corresponding coordinates and showing the URL when clicked
popup = 'https://www.airbnb.com/rooms/' + str(row['id'])
clusters.add_child(Marker([row['latitude'], row['longitude']], popup=popup))
location_map.add_child(clusters)
location_map # Displays the map with the marker clusters
From this, we can see that the largest clusters are near central Boston (where the word "BOSTON" is labeled on the map), which makes sense as that is likely the most populous area. As we move farther away from central Boston, we see that rentals are more sparse, particularly towards the south which has smaller clusters.
Since we have the data for each rental listing the amenities they provide, we produce a bar plot to display what are the most common amenities provided to guests overall across all Boston Airbnb's.
from collections import Counter
import numpy as np
import plotly.express as px
# Defines an empty dictionary to keep track of the counts of for each amenity
all_amenities = {}
# Iterates over every listing, adding to the counts for the respective amenities provided
for index, row in df.iterrows():
all_amenities = Counter(all_amenities) + Counter(row['amenities'])
# Reads the amenities with the top 10 counts into a dataframe
most_amenities = all_amenities.most_common()[:10]
amenities_df = pandas.DataFrame(most_amenities)
amenities_df.columns = ['Amenities', 'Frequency']
# Produces the bar plot of the top 10 amenities with their frequencies
amenities_df.sort_values(by='Frequency', inplace=True) # Sorts so that the most common amenity is at the top of the plot
fig = px.bar(amenities_df, x='Frequency', y='Amenities', orientation='h', title='10 Most Common Amenities Provided')
fig.show()
As we can see, the top 10 amenities include Wifi, Heating, Smoke alarm, and Carbon monoxide alarm. The "Essentials" amenities refer to basic items like toilet paper, soap, towel, and linens. Since our data contained around 2000 rentals and the frequencies of all these amenities are over 1000, then that means over half of the rentals from our data provide these amenities!
Now, the rest of our Exploratory Data Analysis will focus on the data in relation to rental prices and how prices may vary based on different variables, which is the focus of our project.
First, we produce a simple box plot of the distribution of rental prices in both years and check for major outliers.
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Box(y=df['price_2019'].values, name='2019')) # Box plot for 2019 prices
fig.add_trace(go.Box(y=df['price_2020'].values, name='2020')) # Box plot for 2020 prices
# Sets appropriate titles
fig.update_layout(showlegend=False,
title='Distribution of Rental Price in 2019 and 2020',
xaxis_title='Year',
yaxis_title='Price ($)')
fig.show()
From these box plots, we can clearly see that there are major outliers where there are some rentals with very high prices, which is distoring the appearance of our plots since the scale of the y-axis is so large. There are some outliers with prices of about \$4000 per night while the majority of the rentals appear to have prices less than \$500 per night.
Because there are such extreme outliers as shown by the above box plots, we will trim our dataset to try to remove the extreme outliers and reduce the impact they would have on our visualizations and our machine learning. Additionally, we do not expect the removal of these outliers to be too consequential, as these outliers are luxury properties with very high prices, and we are aiming to try to help people determine cheaper Airbnb's. Furthermore, if we did not remove these outliers, then our visualizations and predictive analysis would skew them to be much less meaningful.
To remove outliers, we use the common method that is based on the Interquartile Range (IQR). The IQR is the difference between the 75th (Q3) and 25th (Q1) percentile. Using this, outliers are defined as values > Q3 + 1.5*IQR and values < Q1 - 1.5*IQR.
trim_df = df
# Removes outliers for the price columns for both years
for column in ['price_2019', 'price_2020']:
# Calculates the IQR
Q1 = trim_df[column].quantile(0.25)
Q3 = trim_df[column].quantile(0.75)
IQR = Q3 - Q1
# Uses the IQR to remove outliers and update the dataframe
trim_df = trim_df[trim_df[column] >= Q1 - IQR * 1.5] # Removes the lower outliers
trim_df = trim_df[trim_df[column] <= Q3 + IQR * 1.5] # Removes the upper outliers
trim_df # Outputs the dataset without the major outliers
id | neighborhood | latitude | longitude | room_type | num_reviews | rating | amenities | price_2020 | price_2019 | num_amenities | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3781 | East Boston | 42.36413 | -71.02991 | Entire home/apt | 17 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 150 | 125 | 30 |
1 | 5506 | Roxbury | 42.32981 | -71.09559 | Entire home/apt | 107 | 95.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 145 | 145 | 30 |
2 | 6695 | Roxbury | 42.32994 | -71.09351 | Entire home/apt | 115 | 96.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 169 | 169 | 30 |
3 | 10730 | Downtown | 42.35840 | -71.06185 | Entire home/apt | 32 | 96.0 | ["Cable TV", "Smoke alarm", "TV", "Bed line... | 81 | 150 | 30 |
4 | 10813 | Back Bay | 42.35061 | -71.08787 | Entire home/apt | 10 | 99.0 | ["Cable TV", "Shampoo", "Smoke alarm", "TV"... | 87 | 179 | 23 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2051 | 39444375 | Back Bay | 42.34670 | -71.07945 | Entire home/apt | 1 | 100.0 | ["Shampoo", "Smoke alarm", "TV", "Private e... | 130 | 200 | 22 |
2052 | 39444706 | Back Bay | 42.34536 | -71.07852 | Entire home/apt | 7 | 100.0 | ["Garden or backyard", "Cable TV", "Shampoo"... | 148 | 200 | 36 |
2053 | 39445807 | Back Bay | 42.34645 | -71.07803 | Entire home/apt | 2 | 100.0 | ["Shampoo", "Smoke alarm", "TV", "Bed linen... | 125 | 200 | 29 |
2054 | 39446774 | Back Bay | 42.34663 | -71.07915 | Entire home/apt | 1 | 100.0 | ["Shower gel", "Cable TV", "Shampoo", "Smok... | 148 | 245 | 33 |
2057 | 39447565 | Back Bay | 42.34834 | -71.08152 | Entire home/apt | 1 | 100.0 | ["Shampoo", "Smoke alarm", "TV", "Baking sh... | 148 | 245 | 29 |
1692 rows × 11 columns
After removing the outliers, we now have 1692 listings to analyze. Going forward, we will be using this data without outliers for our visualizations and machine learning. We now produce our box plots of rental price again.
fig = go.Figure()
fig.add_trace(go.Box(y=trim_df['price_2019'].values, name='2019')) # Box plot for 2019 prices
fig.add_trace(go.Box(y=trim_df['price_2020'].values, name='2020')) # Box plot for 2020 prices
# Sets appropriate titles
fig.update_layout(showlegend=False,
title='Distribution of Rental Price in 2019 and 2020',
xaxis_title='Year',
yaxis_title='Price ($)')
fig.show()
Now our box plots look much better because we have removed the extreme outliers. From these plots, it appears that the median rental price in 2020 is less than 2019 since it was \$125 in 2019 and \$99 in 2020. Additionally, the range of prices in 2020 is slightly smaller. Both distributions appear to be right skewed, likely due to the occurence of luxury rentals with high prices that differ from the majority of rentals.
Now, we want to see whether price varies across the different neighborhoods of Boston. To do so, we produce a choropleth map, which will color each neighborhood according to the average rental price for rentals in that neighborhood. Here is a helpful guide to producing choropleth maps with Folium. Darker colors correspond to a higher average rental price, while lighter colors mean a lower average rental price. By using a choropleth map, we will easily be able to visualize the average price for each neighborhood in Boston and how the averages compare to one another. Note that hovering over each region to will display the name of the neighborhood!
# Defines a function that draws the choropleth map for the specified year
def choropleth_map(year):
# Calculates the average rental price for each neighborhood
means = trim_df.groupby('neighborhood')['price_' + year].mean()
# Defines the base map
price_map = folium.Map(location=[42.3, -71.057083], tiles='cartodbpositron', zoom_start=11)
# Defines the choropleth map's properties
choropleth = folium.Choropleth(
geo_data = county_geo,
data = means,
key_on = 'feature.properties.neighbourhood',
fill_color ='YlGnBu',
fill_opacity = 0.7,
line_opacity = 0.2,
legend_name='Price ($)').add_to(price_map)
choropleth.geojson.add_child(folium.features.GeoJsonTooltip(['neighbourhood'],labels=False))
# Sets the title for the map
text = 'Choropleth Map of Airbnb Prices in Boston (' + year + ')'
title_html = '''<p align="center" style="font-size:18px">{}</p>'''.format(text)
price_map.get_root().html.add_child(folium.Element(title_html))
display(price_map) # Displays the map
choropleth_map('2019') # Displays the choropleth map for 2019
In 2019, the most expensive neighborhoods were North End, Downtown, West End, Chinatown, Back Bay, and Fenway. We can see that the more expensive neighborhoods appear to be those near central Boston, with the less expensive districts being those like Hyde Park or Brighton which are closer to the outskirts. Note that the Harbor Islands neighborhood is colored black because there were no rentals with that location. This may be because they are quite small and not as residential as the rest of the city.
We now produce the choropleth map of Airbnb prices in 2020.
choropleth_map('2020') # Displays the choropleth map for 2020
In 2020, the Leather District and West End were the most expensive, followed by North End, Chinatown, Fenway, Charleston, and South Boston Waterfront. Again, the trend of central Boston (meaning the area around Downtown) being more expensive on average remains the same.
In all, from these choropleth maps, we can clearly see that the price of Airbnb rentals does vary by neighborhood since the neighborhoods are colored differently according to the average price of rentals located there. Overall, the most expensive neighborhoods on average tend to be the ones that are closest to central Boston, with the less expensive neighborhoods being the ones further away. However, we can see that the most expensive neighborhoods on average does vary a bit between 2019 and 2020, suggesting that the distribution of prices may have changed from 2019 to 2020.
We now want to explore whether price varies depending on the type of room. To do so, for each type of room, we produce a box plot of the prices for listings that are of that room type.
fig = go.Figure()
# Makes box plots for 2019 and then 2020
fig.add_trace(go.Box(y=trim_df['price_2019'], x=trim_df['room_type'], boxpoints=False, name='2019'))
fig.add_trace(go.Box(y=trim_df['price_2020'], x=trim_df['room_type'], boxpoints=False, name='2020'))
# Sets appropriate titles
fig.update_layout(boxmode='group',
title='Distribution of Rental Price for Different Room Types',
xaxis_title='Room Type',
yaxis_title='Price ($)')
fig.show()
From this, we can see that the listing price does vary depending on the type of room. The distributions for each room type across 2019 and 2020 appear to be largely the same. Shared rooms appear to be the cheapest and hotel rooms appear to be the most expensive on average, which is what we would expect from intuition. Because shared rooms may have less space or less privacy, this may be why they tend to be cheaper. There are some high outliers for entire home/apartment and private room listings, which is understandable considering there may be luxury homes or private rooms, but it is less likely for there to be something like a luxury room that is shared.
Lastly, let's see if price of a rental varies based on the other variables like the number of reviews, number of amenities, or rating. To do so, we will do a simple scatterplot of price against of each of the variables.
from plotly.subplots import make_subplots
# Defines that we want a row of 3 subplots
fig = make_subplots(rows=1, cols=3, horizontal_spacing=0.1,
subplot_titles=("Price vs. # of Reviews", "Price vs. # of Amenities", "Price vs. Rating"))
# Plots price vs. # of reviews
fig.add_trace(go.Scatter(x=trim_df['num_reviews'], y=trim_df['price_2020'], mode='markers'),
row=1, col=1)
# Plots price vs. # of amenities
fig.add_trace(go.Scatter(x=trim_df['num_amenities'], y=trim_df['price_2020'], mode='markers'),
row=1, col=2)
# Plots price vs. rating
fig.add_trace(go.Scatter(x=trim_df['rating'], y=trim_df['price_2020'], mode='markers'),
row=1, col=3)
# Sets appropriate titles
fig.update_layout(showlegend=False)
fig.update_yaxes(title_text="Price ($)", row=1, col=1)
fig.update_xaxes(title_text="# of Reviews", row=1, col=1)
fig.update_xaxes(title_text="# of Amenities", row=1, col=2)
fig.update_xaxes(title_text="Rating", row=1, col=3)
From these plots, we can see that the majority of rentals appear to have less than 200 reviews, less than 40 amenities, and ratings greater than 80. However, there does not seem to be any clear trends between price and the number of reviews, number of amenities, or rating. For a certain number of reviews, number of amenities, or rating, there are rentals at a wide range of prices. Nevertheless, we will include these variables in our predictive analysis to see more closely whether there may actually be a relationship, even if it is slight.
Now that we have done our Exploratory Data Analysis, we will try to do some hypothesis testing and machine learning to more concretely answer whether prices have changed significantly since 2019 and how well we can predict prices of Airbnb rentals.
Let's determine if the factors we talked about show a statisically significant difference between prices pre and post pandeemic through a paired t-test of the 2019 prices and 2020 prices. In a paired t-test, each subject is measured twice to determine if the mean difference between the two sets of measurements are 0.
The null hypothesis and alternative hypothesis that we will be testing are as follows:
$ H_{\theta} $ = The mean difference between the 2019 and 2020 prices are 0.
$ H_{a} $ = The mean difference between the 2019 and 2020 prices are 0.
If we can reject the null hypothesis from the results of the t-test, then we can say that the prices in 2020 for Airbnb rentals are significantly different from the prices in 2019.
from scipy import stats
# Performs the t-test and prints the result of the test
result = stats.ttest_rel(trim_df['price_2020'], trim_df['price_2019'])
print('Test Result: \n' +
't-statistic= ' + str(np.round(result.statistic, decimals=3)) + '\n' +
'p-value= ' + str(result.pvalue) + ' ≈ ' + str(np.round(result.pvalue, decimals=3)) + '\n')
# Prints the mean prices in 2020 and 2019 for comparison
print("Mean 2019 Price ($): " + str(np.round(trim_df['price_2019'].mean(), decimals=2)))
print("Mean 2020 Price ($): " + str(np.round(trim_df['price_2020'].mean(), decimals=2)))
Test Result: t-statistic= -12.708 p-value= 2.1061464045654373e-35 ≈ 0.0 Mean 2019 Price ($): 131.15 Mean 2020 Price ($): 115.44
The p value ($ 2.106 * 10^{-35}$) resulting from the t-test is extemely close to 0, which is less than the common significance level of 0.10, we can reject the null hypothesis. Accordingly, we have sufficient evidence to conclude that the Boston Airbnb prices in 2020 are significantly different from the prices in 2019. Additionally, as we can see, the mean price of Airbnb rentals in 2019 was \$131 per night, which is greater than the mean price in 2020 of \\$115 per night, suggesting that rental prices in 2020 were cheaper on average.
Now we will use machine learning models to try to predict Airbnb rental price based on a variety of variables. We will fit a linear regression model to try to predict prices in 2019 and 2020 based on the variables we have recorded for each rental. For our linear regression models, we will be using Ordinary Least Squares (OLS) which basically means that it will try to minimize the sum of the squared differences between the actual value and the predicted value by a model. First, we'll compare the 2019 and 2020 models and then look more closely at the 2020 model to discuss the takeaways. Then, we'll see if maybe we can improve our model.
Using the Statsmodel package's built in function for OLS linear regression models, we will produce a mulitple linear regression model that attempts to predict price based on the rental's neighborhood, room type, number of reviews, rating, and number of amenities. After computing these models to predict price in 2019 and 2020, we will output the $R^{2}$ value and the p-value result of the F-test of overall significance, which are computed by Statsmodel for us.
The $R^{2}$ value is the percentage of variation in price that can be explained by the predictor variables. In other words, it is a "goodness-of-fit" measure for linear regression models, meaning that it indicates the strength of our linear model in predicting price, with higher values indicating a stronger relationship.
The F-test of overall significance tests whether our model using the independent variables of neighborhood, room type, number of reviews, rating, and number of amenities, is significantly better than a model that does not use any independent variables.
from statsmodels.formula.api import ols
# Defines that we want a multiple linear regression model to predict 2019 price based on the listed variables
model_2019 = ols('price_2019 ~ neighborhood + room_type + num_reviews + rating + num_amenities', data=trim_df)
model_2019 = model_2019.fit() # Fits the model
# Repeats for predicting 2020 price
model_2020 = ols('price_2020 ~ neighborhood + room_type + num_reviews + rating + num_amenities', data=trim_df)
model_2020 = model_2020.fit()
# Prints out the computed R squared value and f-test p value of both modles
print('2019 model:\n' +
'R-squared value: ' + str(np.round(model_2019.rsquared, decimals=3)) + '\n'
'F-test p-value: ' + str(model_2019.f_pvalue) + ' ≈ ' + str(np.round(model_2019.f_pvalue, decimals=3)) + '\n')
print('2020 model:\n' +
'R-squared value: ' + str(np.round(model_2020.rsquared, decimals=3)) + '\n'
'F-test p-value: ' + str(model_2020.f_pvalue) + ' ≈ ' + str(np.round(model_2020.f_pvalue, decimals=3)))
2019 model: R-squared value: 0.539 F-test p-value: 1.570599321055285e-253 ≈ 0.0 2020 model: R-squared value: 0.385 F-test p-value: 5.208932825861506e-152 ≈ 0.0
The $R^{2}$ values for our 2019 and 2020 linear regression models indicate that the models predict the price moderately well. More notably, the R-squared value for the 2019 model (0.539) is greater than the value for the 2020 model (0.385), which indicates that the linear regression model for predicting prices in 2019 performs better than the model for predicting prices in 2020. We will discuss the possible implications of this further in the Conclusion section.
The p-value resulting from the F-test for both models is extremely close to 0, which provides further support that our models predicting price based on neighborhood, room type, number of reviews, rating, and number of amenities is significantly better than a model that does not predict based on any independent variables. In other words, it means that using these independent variables in our models had a signficant improvement the in the model's ability to predict prices, which makes sense considering we saw above in in our EDA phase that price appears to vary with neighborhood and room type.
Now, we will specifically look at the full summary for the linear regression model predicting prices in 2020 to see if we can gain any insight into how to get a cheaper Airbnb in 2020. We will note whether there are any coefficients that have high p-values.
Some coefficients for the predictors may not be statistically significant, even though we have found that our model has significance as a whole. Here, the p-value tests wheher a predictors corresponding coeffecient is different than zero. Low p-values mean that the true coeffecient is significantly different from zero and high p-values mean it is not significantly different to zero. In other words, a predictor that has a low p-value is likely to be a meaningful addition to the regression model because changes in the predictor's value are related to changes in the response variable, while predictors with high p-values are likely not significant.
model_2020.summary() # Displays the full summary for the 2020 model
Dep. Variable: | price_2020 | R-squared: | 0.385 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.374 |
Method: | Least Squares | F-statistic: | 34.70 |
Date: | Mon, 21 Dec 2020 | Prob (F-statistic): | 5.21e-152 |
Time: | 09:17:12 | Log-Likelihood: | -9027.8 |
No. Observations: | 1692 | AIC: | 1.812e+04 |
Df Residuals: | 1661 | BIC: | 1.829e+04 |
Df Model: | 30 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.1291 | 18.099 | 0.118 | 0.906 | -33.371 | 37.629 |
neighborhood[T.Back Bay] | 27.5163 | 7.615 | 3.614 | 0.000 | 12.581 | 42.452 |
neighborhood[T.Bay Village] | 37.3705 | 11.533 | 3.240 | 0.001 | 14.750 | 59.991 |
neighborhood[T.Beacon Hill] | 22.2723 | 7.574 | 2.941 | 0.003 | 7.417 | 37.128 |
neighborhood[T.Brighton] | -1.3180 | 7.051 | -0.187 | 0.852 | -15.149 | 12.513 |
neighborhood[T.Charlestown] | 51.5437 | 9.207 | 5.598 | 0.000 | 33.484 | 69.603 |
neighborhood[T.Chinatown] | 44.1577 | 16.129 | 2.738 | 0.006 | 12.522 | 75.794 |
neighborhood[T.Dorchester] | 13.0525 | 5.998 | 2.176 | 0.030 | 1.287 | 24.818 |
neighborhood[T.Downtown] | 13.8168 | 7.179 | 1.925 | 0.054 | -0.265 | 27.898 |
neighborhood[T.East Boston] | 24.7075 | 7.144 | 3.458 | 0.001 | 10.695 | 38.720 |
neighborhood[T.Fenway] | 53.2351 | 8.691 | 6.126 | 0.000 | 36.190 | 70.281 |
neighborhood[T.Hyde Park] | -20.4330 | 10.023 | -2.039 | 0.042 | -40.091 | -0.775 |
neighborhood[T.Jamaica Plain] | 19.1311 | 6.552 | 2.920 | 0.004 | 6.280 | 31.982 |
neighborhood[T.Leather District] | 54.8785 | 50.976 | 1.077 | 0.282 | -45.106 | 154.863 |
neighborhood[T.Longwood Medical Area] | -9.8361 | 36.202 | -0.272 | 0.786 | -80.842 | 61.170 |
neighborhood[T.Mattapan] | -6.0983 | 10.440 | -0.584 | 0.559 | -26.574 | 14.378 |
neighborhood[T.Mission Hill] | 27.8666 | 10.265 | 2.715 | 0.007 | 7.732 | 48.001 |
neighborhood[T.North End] | 51.1574 | 9.852 | 5.193 | 0.000 | 31.834 | 70.481 |
neighborhood[T.Roslindale] | 6.0916 | 8.562 | 0.711 | 0.477 | -10.702 | 22.885 |
neighborhood[T.Roxbury] | 3.7661 | 6.502 | 0.579 | 0.563 | -8.988 | 16.520 |
neighborhood[T.South Boston] | 34.6359 | 7.453 | 4.648 | 0.000 | 20.019 | 49.253 |
neighborhood[T.South Boston Waterfront] | 64.3980 | 19.859 | 3.243 | 0.001 | 25.446 | 103.350 |
neighborhood[T.South End] | 25.9874 | 6.992 | 3.717 | 0.000 | 12.273 | 39.702 |
neighborhood[T.West End] | 58.1533 | 13.648 | 4.261 | 0.000 | 31.384 | 84.923 |
neighborhood[T.West Roxbury] | 6.2262 | 11.517 | 0.541 | 0.589 | -16.364 | 28.816 |
room_type[T.Hotel room] | 68.6711 | 12.257 | 5.603 | 0.000 | 44.630 | 92.712 |
room_type[T.Private room] | -56.0406 | 2.918 | -19.203 | 0.000 | -61.765 | -50.316 |
room_type[T.Shared room] | -102.2883 | 19.421 | -5.267 | 0.000 | -140.380 | -64.197 |
num_reviews | -0.0603 | 0.016 | -3.693 | 0.000 | -0.092 | -0.028 |
rating | 0.9422 | 0.182 | 5.176 | 0.000 | 0.585 | 1.299 |
num_amenities | 1.3247 | 0.164 | 8.095 | 0.000 | 1.004 | 1.646 |
Omnibus: | 304.363 | Durbin-Watson: | 1.796 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 547.118 |
Skew: | 1.116 | Prob(JB): | 1.57e-119 |
Kurtosis: | 4.668 | Cond. No. | 5.32e+03 |
The p-values for the coefficients are listed under the P>|t| column. From this output, we can see that there are some coefficients with p-values that are greater than the common significance level of 0.10. Some of the predictors with high p-values greater than the typical significance level of .10 include Brighton (p = .852), Longwood Medical Area (p = .786), and West Roxbury (p = .589), amongst a few others. This means that due to lack of evidence, the coefficients for these specific predictors are not very meaningful to the regression and are not statistically significant predictors for price.
Nevertheless, the majority of predictors have statistically significant p-values that are less than our significance level. In addition the model yielded p-values of 0.000 for rating and number of amenitites. This means that despite their small effect on price in the model as mentioned earlier, these predictors are still meaningful and correlate with increases in price.
Accordingly, we now filter to only view the predictors with coefficients that are statistically significant, meaning that they have p-values that are less than the common significance level of 0.10. Then, we can see what are the meaningful takeaways from our model.
p_values = model_2020.pvalues # List of the p-values for the coefficients
alpha = 0.10 # Defines significance level
significant_predictors = [] # Will store the list of significant predictors
# Iterates over every predictor
for predictor in p_values.index:
# Adds predictor to list if it is significant
if p_values[predictor] < alpha:
significant_predictors.append(predictor)
# Prints out the rounded coefficients for each variable of the 2020 model in sorted order
model_2020.params[significant_predictors].sort_values().round(decimals=2)
room_type[T.Shared room] -102.29 room_type[T.Private room] -56.04 neighborhood[T.Hyde Park] -20.43 num_reviews -0.06 rating 0.94 num_amenities 1.32 neighborhood[T.Dorchester] 13.05 neighborhood[T.Downtown] 13.82 neighborhood[T.Jamaica Plain] 19.13 neighborhood[T.Beacon Hill] 22.27 neighborhood[T.East Boston] 24.71 neighborhood[T.South End] 25.99 neighborhood[T.Back Bay] 27.52 neighborhood[T.Mission Hill] 27.87 neighborhood[T.South Boston] 34.64 neighborhood[T.Bay Village] 37.37 neighborhood[T.Chinatown] 44.16 neighborhood[T.North End] 51.16 neighborhood[T.Charlestown] 51.54 neighborhood[T.Fenway] 53.24 neighborhood[T.West End] 58.15 neighborhood[T.South Boston Waterfront] 64.40 room_type[T.Hotel room] 68.67 dtype: float64
From this list of coefficients, we can see the variables associated with negative coefficients that decrease the predicted price of the rental and the variables associated with positive coefficients that increase the predicted price.
When looking at the type of room, we see that shared rooms are least expensive followed by private rooms, since they have the most negative coefficients, which makes sense based on the box plots we did showing how price varies depending on room type. On the other hand, hotel rooms are most expensive, which is also reasonable based on the box plots and intuition.
When looking at the neighborhood, South Boston Waterfront, West End, and Fenway appear to be the three most expensive while Hyde Park, Dorchester, and Downtown appear to be the three least expensive neighborhoods in predicting price. Again, these are the neighborhoods for which our model could predict price for to a decent level of statistical significance, so even though Mattapan has a lesser coefficient, there was not enough evidence to say it had significance. By focusing on the predictors with significant coefficients, we can more confidently say how they correlate to price.
The coefficient for the number of reviews is quite close to 0, indicating that it does not really have that much of an effect on the price. The coefficient for the rating variable is close to 1, meaning that a rental with a perfect rating is expected to be approximately $100 more expensive to rent than a rental with a 0 rating. Of course, this is a very extreme example and the coefficient is relatively small, so rating does not have too much of an impact either. Lastly, the coefficient for number of amenities is also close to 1, meaning that for each amenity provided, the predicted rental price increases by about \$1. Overall, these three predictors do not have too much of a large impact on price since the coefficients are relatively small, but they are still things to perhaps take note of.
Since the previous linear models only moderately well, let's investigate whether a different implementation of linear regression will fit the model better. To do this, we will include interaction terms between the predictors and see if this improves the results.
The previous linear regression model analyzed each predictor separately, without considering the interactions between variables. In this new model, there will be new predictors based on all possible combinations of the variables together instead of just each one individually. This means that the new model will now predict price based on many more factors.
# Defines that we want a multiple linear regression model to predict 2019 price based on the listed variables
# The *'s instead of +'s indicate that we want the variables to interact with each other in the model
model_2019 = ols('price_2019 ~ neighborhood*room_type*num_reviews*rating*num_amenities', data=trim_df)
model_2019 = model_2019.fit() # Fits the model
# Repeats for predicting 2020 price
model_2020 = ols('price_2020 ~ neighborhood*room_type*num_reviews*rating*num_amenities', data=trim_df)
model_2020 = model_2020.fit()
# Prints out the computed R squared value and f-test p value of both models
print('2019 model:\n' +
'R-squared value: ' + str(np.round(model_2019.rsquared, decimals=3)) + '\n'
'F-test p-value: ' + str(model_2019.f_pvalue) + ' ≈ ' + str(np.round(model_2019.f_pvalue, decimals=3)) + '\n')
print('2020 model:\n' +
'R-squared value: ' + str(np.round(model_2020.rsquared, decimals=3)) + '\n'
'F-test p-value: ' + str(model_2020.f_pvalue) + ' ≈ ' + str(np.round(model_2020.f_pvalue, decimals=3)))
2019 model: R-squared value: 0.668 F-test p-value: 5.255463015621847e-162 ≈ 0.0 2020 model: R-squared value: 0.572 F-test p-value: 1.0085380063186761e-100 ≈ 0.0
The $R^{2}$ values for both the 2019 and 2020 models increase significantly. This shows that this new interaction model accounts for more of the variability in price than the previous basic model did. In general this means that this model is significantly more accurate for predicting price in both years, especially for 2020.
The F-test p-values are also extremely close to zero again, meaning that this model is statistically significant in predicting price based on the interactions between neighborhood, room type, number of reviews, rating, and number of amenities.
In all, the interaction model accounted for more of the variability in price compared to the basic independent linear regression model. This might be due to the fact that for each variable separately, many of the rental prices are spread across large ranges. For example, many people will give their Airbnb a high rating if they are satisfied with their stay, this will not always indicate a high price. But, if your Airbnb is a private room in a central neighborhood like West End, you have a high rating, and you offer many amenities, then it is much easier to precict that you will have a high price. In summary: the more combined factors you have to predict price the better, rather than looking at each factor individually. However, even though the interaction model does have a stronger goodness-of-fit measure, because there are so many terms (800 total) from all of the combinations, it makes it practically impossible to actually gain any meaningful interpretation. Thus, even though the 'basic' model may be not as strong in comparison, it is actually more useful in trying to draw conclusions from.
In conclusion, we found that prices for Boston Airbnb rentals were cheaper on average in 2020 compared to 2019. We also found that we can predict the prices of Boston Airbnb rentals moderately well with multiple linear regression models based on a rental's neighborhood, room type, number of reviews, number of amenities, and rating. The model for predicting prices in 2019 was better than the model for predicting prices in 2020. This makes sense considering that with the pandemic, 2020 is a year full of uncertainty and variability that we could not account for in our model. This decreased predictive power in 2020 and overall decrease in prices may be due to a variety of reasons. For example, people on average may be lowering rental prices to try to attract more guests to account for reduced travel, but there may also be some hosts who are trying to raise prices to account for greater cleaning costs or lack of revenue.
In the future, more work could be done in trying to improve the model by trying other regression methods besides linear regression or taking into account some of the other variables in the original dataset like rental availability, but it is of course not going to be easy or necessarily even possible to predict Airbnb prices perfectly! It may also be useful to incorporate data from other months of the year so that there are more data points.
Anyhow, if you are looking to stay in an Airbnb in Boston during this time and looking for a cheaper option, we recommend looking for a shared or private room located in Hyde Park. If you want to be located closer to the center of Boston, then Downtown is probably your best bet! Avoid looking for rentals in South Boston Waterfront or West End which tend to be significantly more expensive. While other variables like the number of reviews, rating, and number of amenities do not have too much of a significant impact on price, expect a higher rated rental with lots of amenities to be more expensive.