637126527303023264OL.jpg

Analyzing the Prices of Boston Airbnb Rentals: What Affects Prices and Have Prices Changed Since the Pandemic?

By Josh Lavitz and Max Nguyen

Note that we hyperlink to additional resources throughout this tutorial that may be useful in explaining terms and things more thoroughly!

Introduction

Airbnb is a company that provides short-term rentals for people to stay in. Airbnb has helped host over 800 million people, with 5.6 million currently active listings over 220 regions around the world. Even despite the challenges to travel and tourism posed by the COVID-19 pandemic, the company has still managed to adapt, continuing to make a profit of $219 million. As work transitions to be remote, people are using Airbnbs to get away from home and do their work from anywhere. Some people are also using Airbnb's as temporary places to socially distance or quarantine.

In this tutorial, we will explore what may affect the prices of Airbnb rentals and how Airbnb prices have changed since the pandemic, if at all. The cost of rentals is a major consideration for people, and economic considerations are perhaps even more important during these times. Specifically, we will be looking at data for rentals in Boston, which is a major city that was recently ranked as one of the three best largest cities in America. We will be comparing the data of listings from October 2020 to the listings from October 2019, as October 2019 was before the pandemic struck the U.S. This data exploration will hopefully be of interest to anyone looking to stay in an Airbnb during this time or in the future, once the circumstances are better.

Python Libraries

In this tutorial, we use Python 3 and a few helpful Python libraries. We list and briefly describe the primary ones below:

Data Collection

Listings Data

Our data comes from Inside Airbnb, which scrapes data from the public listings on the Airbnb website. Since Inside Airbnb has already done the web scraping for us to produce a dataset of listing, we do not have to do much complicated work. As previously mentioned, we are looking at the rentals for Boston, the capital city of Massachusetts. To compare how prices have changed since the pandemic has started, we will compare the data set that was scraped from Airbnb on October 24, 2020 to the data scraped in October 19, 2019, which is almost exactly a year apart.

From the CSV files for October 2020 and October 2019 provided by Inside Airbnb, we read the data into two separate Pandas dataframes (which are like tables) so that the data can be easily accessed and manipulated.

From the displayed dataframes, we can see that there were 3,254 active listings in Boston on October 24, 2020 and 5,647 active listings on October 19, 2019. This means that there were about 2,000 less listings in October 2020, which makes sense considering the challenges with hosting guests in a rental during a pandemic.

Each row corresponds to the information for a specific listing. There are a lot of different columns, some with very long names, and we will not be using all of them for our analysis. Some of the notable columns include:

GeoJSON Data

Inside Airbnb also provides a GeoJSON file, which we can later use to produce visualizations of the specific neighborhoods of Boston. Here is a helpful introduction to working with the GeoJSON file format.

Now that we have read our data from the CSV files into two Pandas dataframe, we have to do some cleaning and curation, which will make our later analysis much easier.

Data Cleaning and Curation

Dropping Unnecessary Columns

First, we drop the unncessary columns in the 2020 dataframe that contain information that we are not using in our analysis. The columns we keep are the same ones as listed above: neighbourhood_cleansed, latitude, longitude, room_type, number_of_reviews, review_scores_rating, amenities, and price since we are most interested in how those variables may affect rental price. We also keep the id column so that it is easier to differentiate rentals, even though it does not really have a relation to price.

A lot of the other columns concern things like host information or the date of the first review on the listing, which are not as relevant to our analysis. For example, there is likely not much of a relation between the name of the host or the URL of the rental listing to the price, so it does not make sense to keep these columns! We also shorten the 'neighbourhood_cleansed' column to simply 'neighborhood'.

Our dataframe is much more readable and manageable now that we have removed the less relevant columns! Now we can focus on the variables that we will actually be analyzing.

Combining the Dataframes into One

Now, we want to merge the two separate dataframes from 2020 and 2019 into one dataframe so that we can have columns showing the rental's price in 2019 and in 2020 in a single dataframe.

We will be doing an inner join, so we will only be keeping the rentals with the same unique id that were active in both 2019 and 2020 and excluding rentals that were only active in 2019 or only active in 2020. That way, we can easily compare how a rental changed its price from 2019 to 2020.

We now have a combined dataframe showing the price for each rental in 2019 and in 2020. We can see that about 2,058 rentals were active in both 2019 and 2020.

Converting Price Column to Int

We convert the price columns into type int, so that they can be treated as numbers. Currently, the prices are written as strings like '$1,000' so we remove the dollar sign and comma to result in '1000' as an example.

The price columns are now properly encoded as integers! Note, that all of the prices were whole numbers, which is why we did not have to worry about converting to type float.

Adding a Column for Number of Amenities

We want to do some analysis to see if the number of amenities provided by a rental affects the price. Since the dataset has a column listing the amenities provided, we want to add a column that counts the number of amenities provided from that list. However, since the list of amenities is currently stored as a string, we need to do some pre-processing so that it is easier to count the number of amenities.

We now have a column at the end of the dataset showing the number of amenities provided by each listing.

Dropping Missing Values

Lastly, we need to remove any rows with missing values to avoid errors later on.

Exploratory Data Analysis

Now that we have finished with our Data Cleaning and Curation, we can move on to doing Exploratory Data Analysis. During this phase, we try to gain some more insight into the dataset through visualizations and also determine if there are any outliers that we need to account for. Primarily, we are trying to see whether room type, neighborhood, rating, number of amenities, or number of reviews could be used to predict price based on whether price varies based on those variables.

General Exploration

Plotting the Locations of Rentals

In order to get a better sense of the location of rentals, we produce a map with Marker Clusters using the Folium package. Here is a helpful walkthrough of using Folium to produce map visualizations. We can plot the rentals as clusters where we can zoom in on a specific cluster to see the individual marker locations. By clicking on a marker, a popup will appear that displays the URL to the listing, which can be typed into a browser to go to the listing's page. (Note, that it is possible for some of the listings to no longer be active, but most of them should be.)

From this, we can see that the largest clusters are near central Boston (where the word "BOSTON" is labeled on the map), which makes sense as that is likely the most populous area. As we move farther away from central Boston, we see that rentals are more sparse, particularly towards the south which has smaller clusters.

What Amenities are Most Common?

Since we have the data for each rental listing the amenities they provide, we produce a bar plot to display what are the most common amenities provided to guests overall across all Boston Airbnb's.

As we can see, the top 10 amenities include Wifi, Heating, Smoke alarm, and Carbon monoxide alarm. The "Essentials" amenities refer to basic items like toilet paper, soap, towel, and linens. Since our data contained around 2000 rentals and the frequencies of all these amenities are over 1000, then that means over half of the rentals from our data provide these amenities!

Exploring Price

Now, the rest of our Exploratory Data Analysis will focus on the data in relation to rental prices and how prices may vary based on different variables, which is the focus of our project.

Visualizing the Distribution of Prices

First, we produce a simple box plot of the distribution of rental prices in both years and check for major outliers.

From these box plots, we can clearly see that there are major outliers where there are some rentals with very high prices, which is distoring the appearance of our plots since the scale of the y-axis is so large. There are some outliers with prices of about \$4000 per night while the majority of the rentals appear to have prices less than \$500 per night.

Excluding Outliers

Because there are such extreme outliers as shown by the above box plots, we will trim our dataset to try to remove the extreme outliers and reduce the impact they would have on our visualizations and our machine learning. Additionally, we do not expect the removal of these outliers to be too consequential, as these outliers are luxury properties with very high prices, and we are aiming to try to help people determine cheaper Airbnb's. Furthermore, if we did not remove these outliers, then our visualizations and predictive analysis would skew them to be much less meaningful.

To remove outliers, we use the common method that is based on the Interquartile Range (IQR). The IQR is the difference between the 75th (Q3) and 25th (Q1) percentile. Using this, outliers are defined as values > Q3 + 1.5*IQR and values < Q1 - 1.5*IQR.

After removing the outliers, we now have 1692 listings to analyze. Going forward, we will be using this data without outliers for our visualizations and machine learning. We now produce our box plots of rental price again.

Now our box plots look much better because we have removed the extreme outliers. From these plots, it appears that the median rental price in 2020 is less than 2019 since it was \$125 in 2019 and \$99 in 2020. Additionally, the range of prices in 2020 is slightly smaller. Both distributions appear to be right skewed, likely due to the occurence of luxury rentals with high prices that differ from the majority of rentals.

Does Price Vary Depending on the Neighborhood?

Now, we want to see whether price varies across the different neighborhoods of Boston. To do so, we produce a choropleth map, which will color each neighborhood according to the average rental price for rentals in that neighborhood. Here is a helpful guide to producing choropleth maps with Folium. Darker colors correspond to a higher average rental price, while lighter colors mean a lower average rental price. By using a choropleth map, we will easily be able to visualize the average price for each neighborhood in Boston and how the averages compare to one another. Note that hovering over each region to will display the name of the neighborhood!

In 2019, the most expensive neighborhoods were North End, Downtown, West End, Chinatown, Back Bay, and Fenway. We can see that the more expensive neighborhoods appear to be those near central Boston, with the less expensive districts being those like Hyde Park or Brighton which are closer to the outskirts. Note that the Harbor Islands neighborhood is colored black because there were no rentals with that location. This may be because they are quite small and not as residential as the rest of the city.

We now produce the choropleth map of Airbnb prices in 2020.

In 2020, the Leather District and West End were the most expensive, followed by North End, Chinatown, Fenway, Charleston, and South Boston Waterfront. Again, the trend of central Boston (meaning the area around Downtown) being more expensive on average remains the same.

In all, from these choropleth maps, we can clearly see that the price of Airbnb rentals does vary by neighborhood since the neighborhoods are colored differently according to the average price of rentals located there. Overall, the most expensive neighborhoods on average tend to be the ones that are closest to central Boston, with the less expensive neighborhoods being the ones further away. However, we can see that the most expensive neighborhoods on average does vary a bit between 2019 and 2020, suggesting that the distribution of prices may have changed from 2019 to 2020.

Does Price Vary Depending on Room Type?

We now want to explore whether price varies depending on the type of room. To do so, for each type of room, we produce a box plot of the prices for listings that are of that room type.

From this, we can see that the listing price does vary depending on the type of room. The distributions for each room type across 2019 and 2020 appear to be largely the same. Shared rooms appear to be the cheapest and hotel rooms appear to be the most expensive on average, which is what we would expect from intuition. Because shared rooms may have less space or less privacy, this may be why they tend to be cheaper. There are some high outliers for entire home/apartment and private room listings, which is understandable considering there may be luxury homes or private rooms, but it is less likely for there to be something like a luxury room that is shared.

Does Price Vary Based on Other Variables?

Lastly, let's see if price of a rental varies based on the other variables like the number of reviews, number of amenities, or rating. To do so, we will do a simple scatterplot of price against of each of the variables.

From these plots, we can see that the majority of rentals appear to have less than 200 reviews, less than 40 amenities, and ratings greater than 80. However, there does not seem to be any clear trends between price and the number of reviews, number of amenities, or rating. For a certain number of reviews, number of amenities, or rating, there are rentals at a wide range of prices. Nevertheless, we will include these variables in our predictive analysis to see more closely whether there may actually be a relationship, even if it is slight.

Hypothesis Testing and Machine Learning

Now that we have done our Exploratory Data Analysis, we will try to do some hypothesis testing and machine learning to more concretely answer whether prices have changed significantly since 2019 and how well we can predict prices of Airbnb rentals.

Are the prices in 2020 significantly different from 2019?

Let's determine if the factors we talked about show a statisically significant difference between prices pre and post pandeemic through a paired t-test of the 2019 prices and 2020 prices. In a paired t-test, each subject is measured twice to determine if the mean difference between the two sets of measurements are 0.

The null hypothesis and alternative hypothesis that we will be testing are as follows:

$ H_{\theta} $ = The mean difference between the 2019 and 2020 prices are 0.
$ H_{a} $ = The mean difference between the 2019 and 2020 prices are 0.

If we can reject the null hypothesis from the results of the t-test, then we can say that the prices in 2020 for Airbnb rentals are significantly different from the prices in 2019.

The p value ($ 2.106 * 10^{-35}$) resulting from the t-test is extemely close to 0, which is less than the common significance level of 0.10, we can reject the null hypothesis. Accordingly, we have sufficient evidence to conclude that the Boston Airbnb prices in 2020 are significantly different from the prices in 2019. Additionally, as we can see, the mean price of Airbnb rentals in 2019 was \$131 per night, which is greater than the mean price in 2020 of \\$115 per night, suggesting that rental prices in 2020 were cheaper on average.

How Well Can We Predict Prices?

Now we will use machine learning models to try to predict Airbnb rental price based on a variety of variables. We will fit a linear regression model to try to predict prices in 2019 and 2020 based on the variables we have recorded for each rental. For our linear regression models, we will be using Ordinary Least Squares (OLS) which basically means that it will try to minimize the sum of the squared differences between the actual value and the predicted value by a model. First, we'll compare the 2019 and 2020 models and then look more closely at the 2020 model to discuss the takeaways. Then, we'll see if maybe we can improve our model.

Basic Model

Comparing Regression Models from 2019 and 2020

Using the Statsmodel package's built in function for OLS linear regression models, we will produce a mulitple linear regression model that attempts to predict price based on the rental's neighborhood, room type, number of reviews, rating, and number of amenities. After computing these models to predict price in 2019 and 2020, we will output the $R^{2}$ value and the p-value result of the F-test of overall significance, which are computed by Statsmodel for us.

The $R^{2}$ value is the percentage of variation in price that can be explained by the predictor variables. In other words, it is a "goodness-of-fit" measure for linear regression models, meaning that it indicates the strength of our linear model in predicting price, with higher values indicating a stronger relationship.

The F-test of overall significance tests whether our model using the independent variables of neighborhood, room type, number of reviews, rating, and number of amenities, is significantly better than a model that does not use any independent variables.

The $R^{2}$ values for our 2019 and 2020 linear regression models indicate that the models predict the price moderately well. More notably, the R-squared value for the 2019 model (0.539) is greater than the value for the 2020 model (0.385), which indicates that the linear regression model for predicting prices in 2019 performs better than the model for predicting prices in 2020. We will discuss the possible implications of this further in the Conclusion section.

The p-value resulting from the F-test for both models is extremely close to 0, which provides further support that our models predicting price based on neighborhood, room type, number of reviews, rating, and number of amenities is significantly better than a model that does not predict based on any independent variables. In other words, it means that using these independent variables in our models had a signficant improvement the in the model's ability to predict prices, which makes sense considering we saw above in in our EDA phase that price appears to vary with neighborhood and room type.

Takeaways from 2020 Model for Predicting Prices

Now, we will specifically look at the full summary for the linear regression model predicting prices in 2020 to see if we can gain any insight into how to get a cheaper Airbnb in 2020. We will note whether there are any coefficients that have high p-values.

Some coefficients for the predictors may not be statistically significant, even though we have found that our model has significance as a whole. Here, the p-value tests wheher a predictors corresponding coeffecient is different than zero. Low p-values mean that the true coeffecient is significantly different from zero and high p-values mean it is not significantly different to zero. In other words, a predictor that has a low p-value is likely to be a meaningful addition to the regression model because changes in the predictor's value are related to changes in the response variable, while predictors with high p-values are likely not significant.

The p-values for the coefficients are listed under the P>|t| column. From this output, we can see that there are some coefficients with p-values that are greater than the common significance level of 0.10. Some of the predictors with high p-values greater than the typical significance level of .10 include Brighton (p = .852), Longwood Medical Area (p = .786), and West Roxbury (p = .589), amongst a few others. This means that due to lack of evidence, the coefficients for these specific predictors are not very meaningful to the regression and are not statistically significant predictors for price.

Nevertheless, the majority of predictors have statistically significant p-values that are less than our significance level. In addition the model yielded p-values of 0.000 for rating and number of amenitites. This means that despite their small effect on price in the model as mentioned earlier, these predictors are still meaningful and correlate with increases in price.

Accordingly, we now filter to only view the predictors with coefficients that are statistically significant, meaning that they have p-values that are less than the common significance level of 0.10. Then, we can see what are the meaningful takeaways from our model.

From this list of coefficients, we can see the variables associated with negative coefficients that decrease the predicted price of the rental and the variables associated with positive coefficients that increase the predicted price.

When looking at the type of room, we see that shared rooms are least expensive followed by private rooms, since they have the most negative coefficients, which makes sense based on the box plots we did showing how price varies depending on room type. On the other hand, hotel rooms are most expensive, which is also reasonable based on the box plots and intuition.

When looking at the neighborhood, South Boston Waterfront, West End, and Fenway appear to be the three most expensive while Hyde Park, Dorchester, and Downtown appear to be the three least expensive neighborhoods in predicting price. Again, these are the neighborhoods for which our model could predict price for to a decent level of statistical significance, so even though Mattapan has a lesser coefficient, there was not enough evidence to say it had significance. By focusing on the predictors with significant coefficients, we can more confidently say how they correlate to price.

The coefficient for the number of reviews is quite close to 0, indicating that it does not really have that much of an effect on the price. The coefficient for the rating variable is close to 1, meaning that a rental with a perfect rating is expected to be approximately $100 more expensive to rent than a rental with a 0 rating. Of course, this is a very extreme example and the coefficient is relatively small, so rating does not have too much of an impact either. Lastly, the coefficient for number of amenities is also close to 1, meaning that for each amenity provided, the predicted rental price increases by about \$1. Overall, these three predictors do not have too much of a large impact on price since the coefficients are relatively small, but they are still things to perhaps take note of.

Can We Improve With an Interaction Model?

Since the previous linear models only moderately well, let's investigate whether a different implementation of linear regression will fit the model better. To do this, we will include interaction terms between the predictors and see if this improves the results.

The previous linear regression model analyzed each predictor separately, without considering the interactions between variables. In this new model, there will be new predictors based on all possible combinations of the variables together instead of just each one individually. This means that the new model will now predict price based on many more factors.

The $R^{2}$ values for both the 2019 and 2020 models increase significantly. This shows that this new interaction model accounts for more of the variability in price than the previous basic model did. In general this means that this model is significantly more accurate for predicting price in both years, especially for 2020.

The F-test p-values are also extremely close to zero again, meaning that this model is statistically significant in predicting price based on the interactions between neighborhood, room type, number of reviews, rating, and number of amenities.

In all, the interaction model accounted for more of the variability in price compared to the basic independent linear regression model. This might be due to the fact that for each variable separately, many of the rental prices are spread across large ranges. For example, many people will give their Airbnb a high rating if they are satisfied with their stay, this will not always indicate a high price. But, if your Airbnb is a private room in a central neighborhood like West End, you have a high rating, and you offer many amenities, then it is much easier to precict that you will have a high price. In summary: the more combined factors you have to predict price the better, rather than looking at each factor individually. However, even though the interaction model does have a stronger goodness-of-fit measure, because there are so many terms (800 total) from all of the combinations, it makes it practically impossible to actually gain any meaningful interpretation. Thus, even though the 'basic' model may be not as strong in comparison, it is actually more useful in trying to draw conclusions from.

Conclusion

In conclusion, we found that prices for Boston Airbnb rentals were cheaper on average in 2020 compared to 2019. We also found that we can predict the prices of Boston Airbnb rentals moderately well with multiple linear regression models based on a rental's neighborhood, room type, number of reviews, number of amenities, and rating. The model for predicting prices in 2019 was better than the model for predicting prices in 2020. This makes sense considering that with the pandemic, 2020 is a year full of uncertainty and variability that we could not account for in our model. This decreased predictive power in 2020 and overall decrease in prices may be due to a variety of reasons. For example, people on average may be lowering rental prices to try to attract more guests to account for reduced travel, but there may also be some hosts who are trying to raise prices to account for greater cleaning costs or lack of revenue.

In the future, more work could be done in trying to improve the model by trying other regression methods besides linear regression or taking into account some of the other variables in the original dataset like rental availability, but it is of course not going to be easy or necessarily even possible to predict Airbnb prices perfectly! It may also be useful to incorporate data from other months of the year so that there are more data points.

Anyhow, if you are looking to stay in an Airbnb in Boston during this time and looking for a cheaper option, we recommend looking for a shared or private room located in Hyde Park. If you want to be located closer to the center of Boston, then Downtown is probably your best bet! Avoid looking for rentals in South Boston Waterfront or West End which tend to be significantly more expensive. While other variables like the number of reviews, rating, and number of amenities do not have too much of a significant impact on price, expect a higher rated rental with lots of amenities to be more expensive.