Inferential Analysis of Real Property Sales in King County in 2019

Business Understanding

“Expert judgments” by Real Estate professionals are sometimes found to be out-of-date, irrelevant to our specific housing market, or even inaccurate. The purpose of this project is to verify and extend on the below claims:

Data Understanding

The repository for this project can be found here. I have been provided with King County House Sales dataset which included 3 tables:

and 2 reference tables:

My understanding of the dataset can be summarized as follows:

Data Preparation

Upon investigating various entries that share the same PIN (PIN = Major + Minor) in the ResBldg dataset, I noticed that their values for the same attributes are usually the same and/or within the similar range.

Utilizing other resources such as Google Maps, as well as Real Estate websites to look up the addresses of these properties, I suspect that it’s more likely these houses/apartments are part of a complex with very similar exterior as well as interior structure. Therefore, I decided to compact entries with duplicated PIN by keeping only 1 row which retains the modes for each attributes. That way, PIN would become unique, and merging the 2 tables RPSale and ResBldg on PIN will not create a large amount of extra sales transactions that did not really happen.

In addition, since the focus of the project is on 2019, I decided to zoom in and only look at transactions that were documented within 2019 and for the specific "Household, single family units" property type since it takes up the majority of the dataset.

I also removed columns that only have 1 unique value because if there is no diversity in a feature’s values in the dataset, that feature would not be taken into consideration during the modeling process.

Besides, table data from the King_County_Home_Sales_Data_Dictionary.pdf has been converted from .pdf to a pandas DataFrame using the tabula-py package for easy mapping and converting numerical values in the master dataset to their text explanation.

The master DataFrame that was used for analysis throughout this project contains 18,320 entries and 60 columns.

Data Engineering

Note: due to my lack of knowledge in Real Estate, my interpretation of some attributes might not be 100% accurate.

Exploratory Data Analysis (EDA)

Looking at SalePrice grouped by Geographical Region based on DistrictName(Reference: https://www.communitiescount.org/king-county-geographies) East, South, Seattle, and North, we can see that different regions do have different SalePrice distributions:

This observation has inspired me to proceed forward with analyzing these 4 regions separately and identifying specific house features that have more impact on SalePrice in some areas than in others.

Modeling

The final version of my model includes 4 sub-models, each of which represents SalePrice as a function of different features specific to each region. Features were selected using stepwise regression by prioritizing feature with statistically significant p-values.

East

The East model includes a total of 31 predictors using 6,459 observations:

Feature                   Coefficient

BldgGrade_12Luxury 7.334610e+05
BldgGrade_11Excellent 6.276256e+05
BldgGrade_10VeryGood 5.139995e+05
Intercept 3.499718e+05
BldgGrade_9Better 3.113092e+05
been_Renovated_1 1.864937e+05
Bath3qtrCount_2 1.596372e+05
BathFullCount_4 1.464185e+05
HeatSystem_HotWater 1.420634e+05
BldgGrade_8Good 1.118976e+05
FpAdditional_2 1.114276e+05
BathFullCount_3 7.285100e+04
Condition_VeryGood 7.128985e+04
Bath3qtrCount_1 7.031951e+04
Bedrooms_2 5.722005e+04
TrafficNoise_HIGH 5.650217e+04
Condition_Good 5.001620e+04
BathHalfCount_1 3.568398e+04
has_OpenPorch_1 2.196033e+04
Yrs_BeforeSold 1.312286e+03
SqFtTotLiving 1.728627e+02
Bedrooms_4 -2.992952e+04
has_TotBasement_1 -3.058062e+04
HeatSystem_ForcedAir -3.324645e+04
BathFullCount_1 -3.751736e+04
HeatSource_Electricity -5.228927e+04
Stories_15 -5.700519e+04
has_Garage_1 -5.707595e+04
Stories_20 -6.806543e+04
BldgGrade_6LowAverage -9.348136e+04
SewerSystem_PRIVATE -1.798429e+05
FpMultiStory_4 -1.302015e+06

South

The South model includes a total of 24 predictors using 5,054 observations:

Feature                  CoefficientBldgGrade_12Luxury       721702.828696
BldgGrade_11Excellent 578963.617759
Bath3qtrCount_4 375935.302588
FpSingleStory_5 375935.302588
BldgGrade_10VeryGood 289932.348898
Stories_30 240020.793238
Intercept 197613.449608
BldgGrade_9Better 162038.281368
Bedrooms_1 138353.756983
been_Renovated_1 88519.986429
BldgGrade_8Good 78985.823195
HeatSystem_HeatPump 72183.967173
HeatSystem_HotWater 52957.971219
HeatSource_Oil 47855.282018
Bath3qtrCount_2 43626.564850
BldgGrade_7Average 29128.533501
Condition_VeryGood 22309.739249
has_OpenPorch_1 13083.558687
Condition_Good 12502.010377
has_Deck_1 11938.616296
SqFtTotLiving 93.877156
FpMultiStory_1 -20053.434843
Bedrooms_6 -48971.410646
BldgGrade_4Low -168706.096122

Seattle

The Seattle model includes a total of 29 predictors with 5,325 observations:

Feature                   CoefficientBldgGrade_12Luxury        1.323574e+06
BldgGrade_11Excellent 1.010147e+06
BathFullCount_5 9.647607e+05
BldgGrade_10VeryGood 7.799366e+05
BldgGrade_9Better 4.964302e+05
BldgGrade_8Good 2.509032e+05
Bedrooms_1 1.563462e+05
Intercept 9.998694e+04
Stories_25 9.501144e+04
BldgGrade_7Average 9.153286e+04
BathHalfCount_2 8.540038e+04
Condition_VeryGood 6.391794e+04
HeatSystem_HeatPump 5.450907e+04
BathFullCount_3 4.423805e+04
FpMultiStory_1 4.390513e+04
Condition_Good 3.832264e+04
has_OpenPorch_1 2.450902e+04
Stories_20 2.257831e+04
has_Deck_1 2.177294e+04
Yrs_BeforeSold 2.139469e+03
SqFtTotLiving 2.148742e+02
HeatSource_Electricity -3.606570e+04
DaylightBasement_Y -3.789555e+04
has_EnclosedPorch_1 -3.856235e+04
TrafficNoise_HIGH -4.452642e+04
FpAdditional_1 -6.583044e+04
TrafficNoise_EXTREME -8.227120e+04
Bedrooms_6 -1.153006e+05
Bedrooms_9 -6.830870e+05

North

The North model includes a total of 12 predictors and significantly less observations (1,177) compared to the other 3 regions:

Feature                  Coefficient
BldgGrade_11Excellent 529635.931261
BldgGrade_10VeryGood 362328.130749
Intercept 301768.878172
BldgGrade_9Better 208055.561190
has_EnclosedPorch_1 138693.658964
been_Renovated_1 131178.994186
BldgGrade_8Good 82155.893172
Condition_VeryGood 77249.593463
HeatSource_Oil 52164.369115
SqFtTotLiving 153.463189
TrafficNoise_MODERATE -60275.450322
TrafficNoise_HIGH -65655.183354

Regionalized model results combined with above EDA visualizations have confirmed that these 4 different regions have different Sale Price distributions with different dominant features. Therefore, homeowners in different regions will most likely need to invest in different types of home improvement projects in order to increase their properties’ values.

Evaluation

Although Adjusted R-Squared values for these 4 models are not as high as some of the other fitted models that include all 4 regions, these sub-models have less predictors, and actually satisfied some of the Linear Regression Assumptions. Although none of these models meets all 4 assumptions, 2 of them (East and Seattle have met the Linearity Assumption, and North has met the Multicollinearity Assumption), which can be considered major improvement compared to the previously fitted 17 models that I have created (If this tickles your fancy, please refer to notebooks in the repository for more details on the other models).

Findings

Due to the above models not having high Adjusted R-Squared values and not meeting Linear Regression Assumptions, I have decided to combine my observations from the Inferential Regression models with Hypothesis Testing to examine inferential capabilities of these models.

Renovation

                         been_Renovated_1          
coefficient

East 186493.650726
South 88519.986429
Seattle
North 131178.994186

Coefficients for been_Renovated_1 (indicating that a property has been renovated at least once) are all positive and was selected as one of the statistically significant features in 3 out of 4 regions (East, South, and North). It can be interpreted from these coefficients that the fact that a property's renovation has been documented could potentially increase its value by an amount between USD 88,000 to USD 186,000 depending on the location region.

Performing a Welch’s t-test on all 4 regions with:

- Null Hypothesis (H_0): There is no difference in average Sale Price between houses that have been renovated and those that have not.- Alternative Hypothesis (H_A): The average Sale Price of houses that have been renovated is greater than the average Sale Price of those that have not.

returns a p-value of 3.3466331975215857e-25. Using a significance level alpha of 0.05, this p-value is small enough (< alpha) to reject the Null Hypothesis, meaning there’s enough evidence to conclude that investing in renovation and/or home improvement projects could increase market value of a property.

Porch (Enclosed and Open)

                has_OpenPorch_1          has_EnclosedPorch_1
coefficient coefficient

East 21960.332911
South 13083.558687
Seattle 24509.02431 -38562.34762
North 138693.658964

Overall, Open Porch seems to be a better choice for home improvement project in King County. Depending on the property’s location, especially in the East, South, and Seattle area, estimated increase in value by adding an Open Porch does vary between USD 13,083 to USD 24,509. However, if the property is in the North region, it might be more beneficial to invest in an Enclosed Porch, which is estimated to increase the Sale Price by USD 138,693 there (this estimate seems to be on the higher side).

Welch’s t-test results do confirm above findings about Porch.

Heat System

Looking at the coefficients for HeatSystem feature, we can identify some of the more financially beneficial for homeowners in based on their location regions are as follows:

Heat System data visualizations have confirmed the inferential findings:

Deck

                has_Deck_1
coefficient

East
South 11938.616296
Seattle 21772.935432
North

Based on the coefficients for has_Deck_1 in the 4 models, including a deck seems to help increase the value of a property in the South and Seattle regions by approximately USD 11,938 and USD 21,772 respectively. There’s not enough evidence from my Linear Regression models to make any conclusion on the impact of a deck on properties in the other 2 regions (East and North).

Welch’s t-test and visualizations both supported the above findings about potential sale increase by including a Deck for properties in South and Seattle. In addition, t-test p-values have indeed confirmed that this should be applicable in the other 2 regions of King County (East and North) as well.

Bedrooms

Bedrooms_1    143115.425173
Bedrooms_3 27458.108473
Bedrooms_2 27408.286476
Bedrooms_6 -68800.299060
Bedrooms_9 -275755.107659

The overall trend that I can notice from these coefficients is that the sale value of a property only increase as we increase number of bedrooms a certain amount (likely in the 4–5 range), and then it starts going down after. This could also partly be due to the lack of entries for the higher amount of bedrooms as compared to lower ones.

Conclusion

Tailoring models based on geographic regions seems to have improved the overall quality and inferential capabilities of my models. Some findings that could be beneficial for home-owners in King County to potentially increase the market value of their properties can be derived from this inferential analysis as follows:

Porch:

Heat System:

Deck:

Bedrooms:

Sale value of a property only increase as we increase number of bedrooms to a certain amount (likely in the 4–5 count range), and then it starts going down after

Next Steps

Overall inferential capabilities of my current models are only capturing about 65% of the impacts of certain attributes on SalePrice.

Data Science Student at Flatiron

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store