“Expert judgments” by Real Estate professionals are sometimes found to be out-of-date, irrelevant to our specific housing market, or even inaccurate. The purpose of this project is to verify and extend on the below claims:
- Enclosing a porch will increase the sale price of a home
- Converting a garage to a bedroom is a good way to increase the sale price of a home
- Upgrading to a forced-air heating system will increase the sale price of a home
The repository for this project can be found here. I have been provided with King County House Sales dataset which included 3 tables:
- Real Property Sales (
EXTR_RPSale.csv) (located in the
- Residential Building (
EXTR_ResBldg.csv) (located in the
- Parcel (
EXTR_Parcel.csv) (located in the
and 2 reference tables:
- Look Up (
EXTR_LookUp.csv) (located in the
King_County_Home_Sales_Data_Dictionary.pdf(located in the
My understanding of the dataset can be summarized as follows:
- Parcel table has 81 columns and 205,199 entries, each of which is a registered extended piece of land with a unique identification number (
PIN). Each lot's PIN can be obtained by concatenating that parcel's
Minorcolumns together. The information Parcel contains is regarding each parcel/lot's position (Plat, Range, Township, etc.), its Area in square footage, its Water & Sewer System, Street Surface, any Restrictions it might have, different Views that can be observed from there, whether there's any type of Noises, or any type of Problems reported about that specific lot, etc.
- Residential Building (or ResBldg) table contains 181,510 entries and 50 columns. Each entry is a present/previously registered Residential Property, as well as other information regarding that property, such as Address, number of Stories/Bathrooms/Bedrooms/Fireplaces, the Square Footage of each floor and Basement, Heat System & Heat Source, the Year it was Built and/or Renovated, its Condition, Grade, as well as any Additional Costs that might have been associated with it. Each property also comes with
Minorcodes, which indicates the parcel that property is on. Please note that there could be multiple properties on a parcel, which means that there could be multiple entries sharing the same
- Real Property Sales (or RPSale) contains 351,067 entries and 24 columns. Each entries is a sales transaction documented. Besides the Sale Price, it also has other information about the sale, such as the Date it was documented, Buyer/Seller, Plat, Sale Reason, etc. The ID for this table can be obtained by concatenating the
Minorto create a string of unique alphanumerical characters. Similar to ResBldg, there can be multiple sales transactions on a parcel, and multiple entries sharing the same PIN (
- Look Up table explains the meaning of numerical codes values in the above tables. For example, Street Surface (
StreetSurface) attribute in Parcel is a number with lookup code 60, which indicates that 1 means
PAVED, 2 means
GRAVEL, 3 means
DIRT, and 4 means
King_County_Home_Sales_Data_Dictionary.pdffile helps matching each attribute to its Look Up code, as well as any other further clarifications on how the data for that attributed is supposed to be entered.
Upon investigating various entries that share the same
PIN = Major + Minor) in the ResBldg dataset, I noticed that their values for the same attributes are usually the same and/or within the similar range.
Utilizing other resources such as Google Maps, as well as Real Estate websites to look up the addresses of these properties, I suspect that it’s more likely these houses/apartments are part of a complex with very similar exterior as well as interior structure. Therefore, I decided to compact entries with duplicated PIN by keeping only 1 row which retains the modes for each attributes. That way,
PIN would become unique, and merging the 2 tables
PIN will not create a large amount of extra sales transactions that did not really happen.
In addition, since the focus of the project is on 2019, I decided to zoom in and only look at transactions that were documented within 2019 and for the specific
"Household, single family units" property type since it takes up the majority of the dataset.
I also removed columns that only have 1 unique value because if there is no diversity in a feature’s values in the dataset, that feature would not be taken into consideration during the modeling process.
Besides, table data from the
King_County_Home_Sales_Data_Dictionary.pdf has been converted from
tabula-py package for easy mapping and converting numerical values in the master dataset to their text explanation.
The master DataFrame that was used for analysis throughout this project contains 18,320 entries and 60 columns.
- Among 60 columns of the master dataset, 14 are for the square footage of various parts of the house. In order to potentially avoid some of the problems with Multicollinearity later on, I decided to drop some columns such as
SqFtUnfinHalf, as well as
SqFtFinBasement. Some columns such as
SqFtDeckwere converted to binary
has_Deckto purely record of whether a property has these features.
- Two columns
DaylightBasementinclude empty string values
' ', which actually takes up a good portion of the dataset. I made the assumption that empty string means 'No', and convert them to
- I also added
Yrs_BeforeSoldwhich stores the amount of time between the year the properties were built and the year their sales transactions were documented (2019).
- Instead of keeping
Yr_Renovated, I converted its data into binary
1/0indicating whether a property has been renovated or not.
Note: due to my lack of knowledge in Real Estate, my interpretation of some attributes might not be 100% accurate.
Exploratory Data Analysis (EDA)
SalePrice grouped by Geographical
Region based on
DistrictName(Reference: https://www.communitiescount.org/king-county-geographies) East, South, Seattle, and North, we can see that different regions do have different
- East, Seattle, and North are overlapping each other more than South.
SalePriceseems to have a smaller mean, smaller standard deviation, yet at higher density compared to other regions.
This observation has inspired me to proceed forward with analyzing these 4 regions separately and identifying specific house features that have more impact on
SalePrice in some areas than in others.
The final version of my model includes 4 sub-models, each of which represents
SalePrice as a function of different features specific to each region. Features were selected using stepwise regression by prioritizing feature with statistically significant p-values.
- Continuous columns included are:
'SalePrice', 'SqFtTotLiving', 'Yrs_BeforeSold'
- Categorical columns identified are:
'Stories', 'BldgGrade', 'DaylightBasement', 'HeatSystem', 'HeatSource', 'ViewUtilization', 'FpSingleStory', 'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'Condition', 'WaterSystem', 'SewerSystem', 'TrafficNoise', 'WaterProblems', 'been_Renovated', 'has_Garage', 'has_TotBasement', 'has_OpenPorch', 'has_EnclosedPorch', 'has_Deck', 'Bedrooms', 'BathHalfCount', 'Bath3qtrCount', 'BathFullCount'(Categorical features would then be converted to dummies using
The East model includes a total of 31 predictors using 6,459 observations:
The South model includes a total of 24 predictors using 5,054 observations:
Feature CoefficientBldgGrade_12Luxury 721702.828696
The Seattle model includes a total of 29 predictors with 5,325 observations:
Feature CoefficientBldgGrade_12Luxury 1.323574e+06
The North model includes a total of 12 predictors and significantly less observations (1,177) compared to the other 3 regions:
Regionalized model results combined with above EDA visualizations have confirmed that these 4 different regions have different Sale Price distributions with different dominant features. Therefore, homeowners in different regions will most likely need to invest in different types of home improvement projects in order to increase their properties’ values.
Although Adjusted R-Squared values for these 4 models are not as high as some of the other fitted models that include all 4 regions, these sub-models have less predictors, and actually satisfied some of the Linear Regression Assumptions. Although none of these models meets all 4 assumptions, 2 of them (East and Seattle have met the Linearity Assumption, and North has met the Multicollinearity Assumption), which can be considered major improvement compared to the previously fitted 17 models that I have created (If this tickles your fancy, please refer to notebooks in the repository for more details on the other models).
Due to the above models not having high Adjusted R-Squared values and not meeting Linear Regression Assumptions, I have decided to combine my observations from the Inferential Regression models with Hypothesis Testing to examine inferential capabilities of these models.
been_Renovated_1 (indicating that a property has been renovated at least once) are all positive and was selected as one of the statistically significant features in 3 out of 4 regions (East, South, and North). It can be interpreted from these coefficients that the fact that a property's renovation has been documented could potentially increase its value by an amount between USD 88,000 to USD 186,000 depending on the location region.
Performing a Welch’s t-test on all 4 regions with:
- Null Hypothesis (H_0): There is no difference in average Sale Price between houses that have been renovated and those that have not.- Alternative Hypothesis (H_A): The average Sale Price of houses that have been renovated is greater than the average Sale Price of those that have not.
returns a p-value of 3.3466331975215857e-25. Using a significance level alpha of 0.05, this p-value is small enough (< alpha) to reject the Null Hypothesis, meaning there’s enough evidence to conclude that investing in renovation and/or home improvement projects could increase market value of a property.
Porch (Enclosed and Open)
Seattle 24509.02431 -38562.34762
Overall, Open Porch seems to be a better choice for home improvement project in King County. Depending on the property’s location, especially in the East, South, and Seattle area, estimated increase in value by adding an Open Porch does vary between USD 13,083 to USD 24,509. However, if the property is in the North region, it might be more beneficial to invest in an Enclosed Porch, which is estimated to increase the Sale Price by USD 138,693 there (this estimate seems to be on the higher side).
Welch’s t-test results do confirm above findings about Porch.
Looking at the coefficients for
HeatSystem feature, we can identify some of the more financially beneficial for homeowners in based on their location regions are as follows:
- Hot Water seems to be the best option in the East region
- Heat Pump and Hot Water seem to be the better choices for properties in the South and Seattle regions
- No specific type of Heat System was selected as predictors in the North
Heat System data visualizations have confirmed the inferential findings:
- Forced Air may be the most popular choice for homeowners in King County (overall denstity of ForcedAir is higher than that of other systems), but
- Hot Water seems to be the best option in the East region
- Heat Pump and Hot Water seem to be the top choices for properties in the South and Seattle regions
- Heat Pump or Hot Water would probably add more value to a property in the North region as well (although this is derived from visualization only)
Based on the coefficients for
has_Deck_1 in the 4 models, including a deck seems to help increase the value of a property in the South and Seattle regions by approximately USD 11,938 and USD 21,772 respectively. There’s not enough evidence from my Linear Regression models to make any conclusion on the impact of a deck on properties in the other 2 regions (East and North).
Welch’s t-test and visualizations both supported the above findings about potential sale increase by including a Deck for properties in South and Seattle. In addition, t-test p-values have indeed confirmed that this should be applicable in the other 2 regions of King County (East and North) as well.
The overall trend that I can notice from these coefficients is that the sale value of a property only increase as we increase number of bedrooms a certain amount (likely in the 4–5 range), and then it starts going down after. This could also partly be due to the lack of entries for the higher amount of bedrooms as compared to lower ones.
Tailoring models based on geographic regions seems to have improved the overall quality and inferential capabilities of my models. Some findings that could be beneficial for home-owners in King County to potentially increase the market value of their properties can be derived from this inferential analysis as follows:
- First and foremost, it is worth investing in a Renovation and/or Home Improvement Projects, as houses that have been documented as “been renovated” have a significantly higher sales price than those that have not. This is highly recommended especially for properties in the Eastern, Southern and Northern part of King County, where investing in a renovation can increase sale price by USD 88,000 to USD 186,000.
- Home-owners in the East, South, and Seattle areas should invest in Open Porch: Estimated increase in value by adding an Open Porch does varies between USD 13,000 to USD 24,000 depending on the specific area.
- However, for home-owners in the Northern part of King County, an Enclosed Porch might add more value to the house by approximately USD 138,000.
- Forced Air may be the most popular choices for homowners in King County, but
- Hot Water seems to be the best option in the East region (estimated sale price increase of USD 142000)
- Heat Pump and Hot Water seems to be the top choices for properties in the South and Seattle regions (estimated sale price increase of USD 52,000 to USD 72,000), and most likely in the North region of King County as well.
- Houses with Deck have significantly higher average sale price than those that don’t.
- Including a Deck seems to help increase the value of a property in the South and Seattle regions by approximately USD 11,000 and USD 21,000 respectively.
Sale value of a property only increase as we increase number of bedrooms to a certain amount (likely in the 4–5 count range), and then it starts going down after
Overall inferential capabilities of my current models are only capturing about 65% of the impacts of certain attributes on
- Ideally, I would like to explore the impact of different features’ interactions on the Sale Price of residential properties in each region of King County; however, due to computational limitations, I was not able to include it in this project at this time.
- In addition, inferential models for this project have been built mainly based on an outliers-removed copy of target variable
SalePrice. A data transformation/cleaning combination that outperforms this current version of the model has not yet been identified. In addition, some of the thresholds/metrics used to establish models are still arbitrary and worth experimenting with in the future.