Inferential Analysis of Real Property Sales in King County in 2019

Chi Bui
12 min readMar 26, 2021

Business Understanding

“Expert judgments” by Real Estate professionals are sometimes found to be out-of-date, irrelevant to our specific housing market, or even inaccurate. The purpose of this project is to verify and extend on the below claims:

  1. Enclosing a porch will increase the sale price of a home
  2. Converting a garage to a bedroom is a good way to increase the sale price of a home
  3. Upgrading to a forced-air heating system will increase the sale price of a home

Data Understanding

The repository for this project can be found here. I have been provided with King County House Sales dataset which included 3 tables:

  1. Real Property Sales (EXTR_RPSale.csv) (located in the data/raw directory)
  2. Residential Building (EXTR_ResBldg.csv) (located in the data/raw directory)
  3. Parcel (EXTR_Parcel.csv) (located in the data/raw directory)

and 2 reference tables:

  1. Look Up (EXTR_LookUp.csv) (located in the data/raw directory)
  2. King_County_Home_Sales_Data_Dictionary.pdf (located in the references directory)

My understanding of the dataset can be summarized as follows:

  • Parcel table has 81 columns and 205,199 entries, each of which is a registered extended piece of land with a unique identification number (PIN). Each lot's PIN can be obtained by concatenating that parcel's Major and Minor columns together. The information Parcel contains is regarding each parcel/lot's position (Plat, Range, Township, etc.), its Area in square footage, its Water & Sewer System, Street Surface, any Restrictions it might have, different Views that can be observed from there, whether there's any type of Noises, or any type of Problems reported about that specific lot, etc.
  • Residential Building (or ResBldg) table contains 181,510 entries and 50 columns. Each entry is a present/previously registered Residential Property, as well as other information regarding that property, such as Address, number of Stories/Bathrooms/Bedrooms/Fireplaces, the Square Footage of each floor and Basement, Heat System & Heat Source, the Year it was Built and/or Renovated, its Condition, Grade, as well as any Additional Costs that might have been associated with it. Each property also comes with Major and Minor codes, which indicates the parcel that property is on. Please note that there could be multiple properties on a parcel, which means that there could be multiple entries sharing the same Major and Minor.
  • Real Property Sales (or RPSale) contains 351,067 entries and 24 columns. Each entries is a sales transaction documented. Besides the Sale Price, it also has other information about the sale, such as the Date it was documented, Buyer/Seller, Plat, Sale Reason, etc. The ID for this table can be obtained by concatenating the ExciseTaxNbr with Major and Minor to create a string of unique alphanumerical characters. Similar to ResBldg, there can be multiple sales transactions on a parcel, and multiple entries sharing the same PIN (Major + Minor).
  • Look Up table explains the meaning of numerical codes values in the above tables. For example, Street Surface (StreetSurface) attribute in Parcel is a number with lookup code 60, which indicates that 1 means PAVED, 2 means GRAVEL, 3 means DIRT, and 4 means UNDEVELOPED.
  • King_County_Home_Sales_Data_Dictionary.pdf file helps matching each attribute to its Look Up code, as well as any other further clarifications on how the data for that attributed is supposed to be entered.

Data Preparation

Upon investigating various entries that share the same PIN (PIN = Major + Minor) in the ResBldg dataset, I noticed that their values for the same attributes are usually the same and/or within the similar range.

Utilizing other resources such as Google Maps, as well as Real Estate websites to look up the addresses of these properties, I suspect that it’s more likely these houses/apartments are part of a complex with very similar exterior as well as interior structure. Therefore, I decided to compact entries with duplicated PIN by keeping only 1 row which retains the modes for each attributes. That way, PIN would become unique, and merging the 2 tables RPSale and ResBldg on PIN will not create a large amount of extra sales transactions that did not really happen.

In addition, since the focus of the project is on 2019, I decided to zoom in and only look at transactions that were documented within 2019 and for the specific "Household, single family units" property type since it takes up the majority of the dataset.

I also removed columns that only have 1 unique value because if there is no diversity in a feature’s values in the dataset, that feature would not be taken into consideration during the modeling process.

Besides, table data from the King_County_Home_Sales_Data_Dictionary.pdf has been converted from .pdf to a pandas DataFrame using the tabula-py package for easy mapping and converting numerical values in the master dataset to their text explanation.

The master DataFrame that was used for analysis throughout this project contains 18,320 entries and 60 columns.

Data Engineering

  • Among 60 columns of the master dataset, 14 are for the square footage of various parts of the house. In order to potentially avoid some of the problems with Multicollinearity later on, I decided to drop some columns such as SqFt1stFloor, SqFtHalfFloor, SqFt2ndFloor, SqFtUpperFloor, SqFtUnfinFull, SqFtUnfinHalf, as well as SqFtFinBasement. Some columns such as SqFtGarageBasement, SqFtGarageAttached, SqFtOpenPorch, SqFtEnclosedPorch, and SqFtDeck were converted to binary has_Garage, has_OpenPorch, has_EnclosedPorch, and has_Deck to purely record of whether a property has these features.
  • Two columns ViewUtilization, DaylightBasement include empty string values ' ', which actually takes up a good portion of the dataset. I made the assumption that empty string means 'No', and convert them to 'N' instead.
  • I also added Yrs_BeforeSold which stores the amount of time between the year the properties were built and the year their sales transactions were documented (2019).
  • Instead of keeping Yr_Renovated, I converted its data into binary been_Renovated with values 1/0 indicating whether a property has been renovated or not.

Note: due to my lack of knowledge in Real Estate, my interpretation of some attributes might not be 100% accurate.

Exploratory Data Analysis (EDA)

Looking at SalePrice grouped by Geographical Region based on DistrictName(Reference: https://www.communitiescount.org/king-county-geographies) East, South, Seattle, and North, we can see that different regions do have different SalePrice distributions:

  • East, Seattle, and North are overlapping each other more than South.
  • South’s SalePrice seems to have a smaller mean, smaller standard deviation, yet at higher density compared to other regions.

This observation has inspired me to proceed forward with analyzing these 4 regions separately and identifying specific house features that have more impact on SalePrice in some areas than in others.

Modeling

The final version of my model includes 4 sub-models, each of which represents SalePrice as a function of different features specific to each region. Features were selected using stepwise regression by prioritizing feature with statistically significant p-values.

  • Continuous columns included are: 'SalePrice', 'SqFtTotLiving', 'Yrs_BeforeSold'
  • Categorical columns identified are: 'Stories', 'BldgGrade', 'DaylightBasement', 'HeatSystem', 'HeatSource', 'ViewUtilization', 'FpSingleStory', 'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'Condition', 'WaterSystem', 'SewerSystem', 'TrafficNoise', 'WaterProblems', 'been_Renovated', 'has_Garage', 'has_TotBasement', 'has_OpenPorch', 'has_EnclosedPorch', 'has_Deck', 'Bedrooms', 'BathHalfCount', 'Bath3qtrCount', 'BathFullCount' (Categorical features would then be converted to dummies using sklearn's OneHotEncoder()).

East

The East model includes a total of 31 predictors using 6,459 observations:

Feature                   Coefficient

BldgGrade_12Luxury 7.334610e+05
BldgGrade_11Excellent 6.276256e+05
BldgGrade_10VeryGood 5.139995e+05
Intercept 3.499718e+05
BldgGrade_9Better 3.113092e+05
been_Renovated_1 1.864937e+05
Bath3qtrCount_2 1.596372e+05
BathFullCount_4 1.464185e+05
HeatSystem_HotWater 1.420634e+05
BldgGrade_8Good 1.118976e+05
FpAdditional_2 1.114276e+05
BathFullCount_3 7.285100e+04
Condition_VeryGood 7.128985e+04
Bath3qtrCount_1 7.031951e+04
Bedrooms_2 5.722005e+04
TrafficNoise_HIGH 5.650217e+04
Condition_Good 5.001620e+04
BathHalfCount_1 3.568398e+04
has_OpenPorch_1 2.196033e+04
Yrs_BeforeSold 1.312286e+03
SqFtTotLiving 1.728627e+02
Bedrooms_4 -2.992952e+04
has_TotBasement_1 -3.058062e+04
HeatSystem_ForcedAir -3.324645e+04
BathFullCount_1 -3.751736e+04
HeatSource_Electricity -5.228927e+04
Stories_15 -5.700519e+04
has_Garage_1 -5.707595e+04
Stories_20 -6.806543e+04
BldgGrade_6LowAverage -9.348136e+04
SewerSystem_PRIVATE -1.798429e+05
FpMultiStory_4 -1.302015e+06

South

The South model includes a total of 24 predictors using 5,054 observations:

Feature                  CoefficientBldgGrade_12Luxury       721702.828696
BldgGrade_11Excellent 578963.617759
Bath3qtrCount_4 375935.302588
FpSingleStory_5 375935.302588
BldgGrade_10VeryGood 289932.348898
Stories_30 240020.793238
Intercept 197613.449608
BldgGrade_9Better 162038.281368
Bedrooms_1 138353.756983
been_Renovated_1 88519.986429
BldgGrade_8Good 78985.823195
HeatSystem_HeatPump 72183.967173
HeatSystem_HotWater 52957.971219
HeatSource_Oil 47855.282018
Bath3qtrCount_2 43626.564850
BldgGrade_7Average 29128.533501
Condition_VeryGood 22309.739249
has_OpenPorch_1 13083.558687
Condition_Good 12502.010377
has_Deck_1 11938.616296
SqFtTotLiving 93.877156
FpMultiStory_1 -20053.434843
Bedrooms_6 -48971.410646
BldgGrade_4Low -168706.096122

Seattle

The Seattle model includes a total of 29 predictors with 5,325 observations:

Feature                   CoefficientBldgGrade_12Luxury        1.323574e+06
BldgGrade_11Excellent 1.010147e+06
BathFullCount_5 9.647607e+05
BldgGrade_10VeryGood 7.799366e+05
BldgGrade_9Better 4.964302e+05
BldgGrade_8Good 2.509032e+05
Bedrooms_1 1.563462e+05
Intercept 9.998694e+04
Stories_25 9.501144e+04
BldgGrade_7Average 9.153286e+04
BathHalfCount_2 8.540038e+04
Condition_VeryGood 6.391794e+04
HeatSystem_HeatPump 5.450907e+04
BathFullCount_3 4.423805e+04
FpMultiStory_1 4.390513e+04
Condition_Good 3.832264e+04
has_OpenPorch_1 2.450902e+04
Stories_20 2.257831e+04
has_Deck_1 2.177294e+04
Yrs_BeforeSold 2.139469e+03
SqFtTotLiving 2.148742e+02
HeatSource_Electricity -3.606570e+04
DaylightBasement_Y -3.789555e+04
has_EnclosedPorch_1 -3.856235e+04
TrafficNoise_HIGH -4.452642e+04
FpAdditional_1 -6.583044e+04
TrafficNoise_EXTREME -8.227120e+04
Bedrooms_6 -1.153006e+05
Bedrooms_9 -6.830870e+05

North

The North model includes a total of 12 predictors and significantly less observations (1,177) compared to the other 3 regions:

Feature                  Coefficient
BldgGrade_11Excellent 529635.931261
BldgGrade_10VeryGood 362328.130749
Intercept 301768.878172
BldgGrade_9Better 208055.561190
has_EnclosedPorch_1 138693.658964
been_Renovated_1 131178.994186
BldgGrade_8Good 82155.893172
Condition_VeryGood 77249.593463
HeatSource_Oil 52164.369115
SqFtTotLiving 153.463189
TrafficNoise_MODERATE -60275.450322
TrafficNoise_HIGH -65655.183354

Regionalized model results combined with above EDA visualizations have confirmed that these 4 different regions have different Sale Price distributions with different dominant features. Therefore, homeowners in different regions will most likely need to invest in different types of home improvement projects in order to increase their properties’ values.

Evaluation

Although Adjusted R-Squared values for these 4 models are not as high as some of the other fitted models that include all 4 regions, these sub-models have less predictors, and actually satisfied some of the Linear Regression Assumptions. Although none of these models meets all 4 assumptions, 2 of them (East and Seattle have met the Linearity Assumption, and North has met the Multicollinearity Assumption), which can be considered major improvement compared to the previously fitted 17 models that I have created (If this tickles your fancy, please refer to notebooks in the repository for more details on the other models).

Findings

Due to the above models not having high Adjusted R-Squared values and not meeting Linear Regression Assumptions, I have decided to combine my observations from the Inferential Regression models with Hypothesis Testing to examine inferential capabilities of these models.

Renovation

                         been_Renovated_1          
coefficient

East 186493.650726
South 88519.986429
Seattle
North 131178.994186

Coefficients for been_Renovated_1 (indicating that a property has been renovated at least once) are all positive and was selected as one of the statistically significant features in 3 out of 4 regions (East, South, and North). It can be interpreted from these coefficients that the fact that a property's renovation has been documented could potentially increase its value by an amount between USD 88,000 to USD 186,000 depending on the location region.

Performing a Welch’s t-test on all 4 regions with:

- Null Hypothesis (H_0): There is no difference in average Sale Price between houses that have been renovated and those that have not.- Alternative Hypothesis (H_A): The average Sale Price of houses that have been renovated is greater than the average Sale Price of those that have not.

returns a p-value of 3.3466331975215857e-25. Using a significance level alpha of 0.05, this p-value is small enough (< alpha) to reject the Null Hypothesis, meaning there’s enough evidence to conclude that investing in renovation and/or home improvement projects could increase market value of a property.

Porch (Enclosed and Open)

                has_OpenPorch_1          has_EnclosedPorch_1
coefficient coefficient

East 21960.332911
South 13083.558687
Seattle 24509.02431 -38562.34762
North 138693.658964

Overall, Open Porch seems to be a better choice for home improvement project in King County. Depending on the property’s location, especially in the East, South, and Seattle area, estimated increase in value by adding an Open Porch does vary between USD 13,083 to USD 24,509. However, if the property is in the North region, it might be more beneficial to invest in an Enclosed Porch, which is estimated to increase the Sale Price by USD 138,693 there (this estimate seems to be on the higher side).

Welch’s t-test results do confirm above findings about Porch.

Heat System

Looking at the coefficients for HeatSystem feature, we can identify some of the more financially beneficial for homeowners in based on their location regions are as follows:

  • Hot Water seems to be the best option in the East region
  • Heat Pump and Hot Water seem to be the better choices for properties in the South and Seattle regions
  • No specific type of Heat System was selected as predictors in the North

Heat System data visualizations have confirmed the inferential findings:

  • Forced Air may be the most popular choice for homeowners in King County (overall denstity of ForcedAir is higher than that of other systems), but
  • Hot Water seems to be the best option in the East region
  • Heat Pump and Hot Water seem to be the top choices for properties in the South and Seattle regions
  • Heat Pump or Hot Water would probably add more value to a property in the North region as well (although this is derived from visualization only)

Deck

                has_Deck_1
coefficient

East
South 11938.616296
Seattle 21772.935432
North

Based on the coefficients for has_Deck_1 in the 4 models, including a deck seems to help increase the value of a property in the South and Seattle regions by approximately USD 11,938 and USD 21,772 respectively. There’s not enough evidence from my Linear Regression models to make any conclusion on the impact of a deck on properties in the other 2 regions (East and North).

Welch’s t-test and visualizations both supported the above findings about potential sale increase by including a Deck for properties in South and Seattle. In addition, t-test p-values have indeed confirmed that this should be applicable in the other 2 regions of King County (East and North) as well.

Bedrooms

Bedrooms_1    143115.425173
Bedrooms_3 27458.108473
Bedrooms_2 27408.286476
Bedrooms_6 -68800.299060
Bedrooms_9 -275755.107659

The overall trend that I can notice from these coefficients is that the sale value of a property only increase as we increase number of bedrooms a certain amount (likely in the 4–5 range), and then it starts going down after. This could also partly be due to the lack of entries for the higher amount of bedrooms as compared to lower ones.

Conclusion

Tailoring models based on geographic regions seems to have improved the overall quality and inferential capabilities of my models. Some findings that could be beneficial for home-owners in King County to potentially increase the market value of their properties can be derived from this inferential analysis as follows:

  • First and foremost, it is worth investing in a Renovation and/or Home Improvement Projects, as houses that have been documented as “been renovated” have a significantly higher sales price than those that have not. This is highly recommended especially for properties in the Eastern, Southern and Northern part of King County, where investing in a renovation can increase sale price by USD 88,000 to USD 186,000.

Porch:

  • Home-owners in the East, South, and Seattle areas should invest in Open Porch: Estimated increase in value by adding an Open Porch does varies between USD 13,000 to USD 24,000 depending on the specific area.
  • However, for home-owners in the Northern part of King County, an Enclosed Porch might add more value to the house by approximately USD 138,000.

Heat System:

  • Forced Air may be the most popular choices for homowners in King County, but
  • Hot Water seems to be the best option in the East region (estimated sale price increase of USD 142000)
  • Heat Pump and Hot Water seems to be the top choices for properties in the South and Seattle regions (estimated sale price increase of USD 52,000 to USD 72,000), and most likely in the North region of King County as well.

Deck:

  • Houses with Deck have significantly higher average sale price than those that don’t.
  • Including a Deck seems to help increase the value of a property in the South and Seattle regions by approximately USD 11,000 and USD 21,000 respectively.

Bedrooms:

Sale value of a property only increase as we increase number of bedrooms to a certain amount (likely in the 4–5 count range), and then it starts going down after

Next Steps

Overall inferential capabilities of my current models are only capturing about 65% of the impacts of certain attributes on SalePrice.

  • Ideally, I would like to explore the impact of different features’ interactions on the Sale Price of residential properties in each region of King County; however, due to computational limitations, I was not able to include it in this project at this time.
  • In addition, inferential models for this project have been built mainly based on an outliers-removed copy of target variable SalePrice. A data transformation/cleaning combination that outperforms this current version of the model has not yet been identified. In addition, some of the thresholds/metrics used to establish models are still arbitrary and worth experimenting with in the future.

--

--