King County House Sales - Linear Regression Model with Python

In this article, I will describe how I went through the typical data analysis and transformation process and how I approached the well-known King Count House Sales data set. We will see how to fit and validate the linear regression model based on the house sales data, review and assess the four linear regression assumptions, predict and interpret the final results!

1. Data Overview and Exploratory Data Analysis (EDA)

Let's take a look at the already cleaned-up dataset that will be used in my linear regression model. Using Pandas, the NaN values and special characters have been replaced with correct numeric values applying pd.to_numeric() and isnull() methods. Some data like yr_renovated has been binarized to reflect whether the house was renovated or not using .astype(int) function. Few duplicates based on the 'id' column has been dropped with .drop().

Price distribution - let's use Seaborn and scatter plot to check the target (or dependant) variable - house prices - distribution across King County and see if we can come to any conclusions.

plt.figure(figsize=(15,10))
ax = sns.scatterplot(data=df, 
                x="long", 
                y="lat", 
                hue="price",
                hue_norm=(0, 3000000),
                size='price', 
                sizes=(10, 200))
ax.set_title('Price distribution')

Looking at the scatter plot above, we can say that the distance between Seattle and Bellevue clearly affects the price. Based on those observations we will perform some feature engineering in the next step and incorporate a column with distances to the point between those two cities in order to avoid having two colinear parameters.

Categorize zip codes by mean house price - zip codes are also great elements of the data set that can be used to increase the R-squared value. Again, the scatter plot indicates that the price seems to be higher or lower in different areas that can be translated to the zip codes. We can then categorize the zip codes from the most expensive to the cheapest ones by checking the mean house price in each one and assigning a price category.

Data type and distribution analysis - Seaborn distplots and scatter plots.

Now, let's explore the distribution of the independent values with histograms and determine the discrete and continuous data.

You can tell that skewness is an issue for most of our continuous variables and that some features e.g. 'sqft_basement' or 'price' are much bigger in magnitude than others. The solution will be a log transformation of the data to help to fit a very skewed distribution into a Gaussian one.

For the 'sqft_basement' values the solution would be to binarize the data to reflect whether the house has a basement or not.

One-hot encoding (dummy variables) - There are three types of categorical variables: binary, nominal, and ordinal variables.

One-hot encoding is a great way to transform nominal categorical variables, however, it doesn't make much sense to apply this method to binary or ordinal variables. In our case, we will keep the binary variables like 'waterfront' or 'has_basement' and ordinal variables like 'bedrooms' or 'condition' in the current form and we won't use dummy variables.

2. Feature engineering

Ok, our dataset needs some additional features to better fit the future linear regression model. Let's add the following columns: 'has_basement', dist_epicenter', and categorized zip codes by price median.

#Adding has_basement parameter - 1 where basement area > 0 
df['has_basement'] = (df['sqft_basement'] > 0).astype(int)


#Dropping sqft_basement column
df.drop(columns='sqft_basement', inplace=True)


#Adding distance from epicenter using Haversine library
from haversine import haversine, Unit

epicenter = [47.60558637885264, -122.25954856134759]

df['dist_epicenter'] = df.apply(lambda x:haversine(epicenter, [x['lat'],x['long']] , unit='km'), axis=1)

#Categorize zipcodes
zip_cat_df = df.groupby(by='zipcode')['price'].mean().sort_values(ascending=False).reset_index()
zip_cat_df['zip_price_cat'] = pd.cut(zip_cat_df['price'], bins=5, 
labels=False) + 1

df = df.merge(zip_cat_df.drop(columns='price'), how='left', on='zipcode')

3. Identification of outliers with boxplots

These points are especially important because they can have a strong inﬂuence on the linear regression model fit and normal distribution of the residuals resulting in a loss of accuracy. Let's look for some extreme values and drop them.

Ok, now we can see clearly that there are some strange properties with 11 and 33 bedrooms at a suspiciously low price. Also, we have some unusually big lots and prices, as well as some properties, located far away from Seattle and Bellevue. Let's trim the most extreme of them.

4. Multicollinearity of Features

Multiple predictor variables generally increase model performance and lead to better predictions. There are also possible negative effects when using multiple predictors that have a high correlation with each other. A high correlation can be considered if the value is around 0.7-0.8 or higher. A quick check for highly correlated values with the Seaborn heatmap.

In the example above the number of highly correlated pairs is quite low. But what if there were much more of them? Because the .corr() function returns a dataframe, we could use stack and a subset to return only the highly correlated pairs. See the example below:

5. Log transformation and normalization - dealing with skewed data

Resurrection Butterflies by jmichaelrios

Real-life data is typically skewed, but too much skewness can really harm the accuracy of a linear regression model. The tail of a skewed distribution may act as an outlier for the statistical model and that can strongly heavily the model’s performance, especially in regression models. A log transformation can help to fit a very skewed distribution into a Gaussian one. After log transformation, we can easily see how the distribution of the continuous variables has changed. As an example see the King County House Sales data distribution below.

Usually, our dataset contains features that vary largely in magnitudes, i.e 'price' and 'sqft_living'. If these magnitudes are left unchanged, coefficient sizes will fluctuate largely in magnitude as well. This could give the false impression that some variables with higher values are more important than others with lower values.

So, before you start to fit your, regression model, make sure your features are not only normally distributed but also have a similar scale. This is because most machine learning algorithms use Euclidean distance between two data points in their computations

Some features before log transformation and scaling:

And after:

#Log transform and normalize
df_cont = df[subset_cont]

#Log features
log_names = [f'{column}_log' for column in df_cont.columns]
df_log = np.log(df_cont)
df_log.columns = log_names

df_log.hist(figsize = (8,7), bins=30);

6. Multiple Linear Regression with Statsmodels - Model Fit and Validation

Ok, we got to the point where we are finally ready to fit our multiple linear regression model using the Statsmodels and OLS (ordinary least square) method, and Scikit-Learn libraries. A good starting point is to fit your model to all of your preprocessed variables and check the R-squared value. Usually, you would need between 5-10 predictors in order to provide the best performance and avoid overfitting the model. But how to find out which predictors are the most statistically significant? The best solution is to perform a step-wise selection and select predictors which are most important when predicting the target.

After testing different configurations with the selected predictors the R-squared varied from 0.53 to 0.87. I eventually decided to proceed with the following selection and results:

So, looking at the results above, the R-squared value of 0.717 seems to be appropriate in terms of the model fit.

In general the higher the R-squared the better, however, a too high R-squared does not necessarily indicate that the model has a good fit and may indicate overfitting. R-squared doesn't determine if the coefficient estimates and predictions are biased, which is why we need to analyze the residual plots in the next steps.

Skew and Kurtosis seem fine too since the normal distribution has 0 skew. Kurtosis measures the shape of the distribution. Compares the amount of data close to the mean with those far away from the mean (in the tails), so model "peakiness". The normal distribution has a Kurtosis of 3, and the greater the number, the more the curve peaks. All the coefficients are also significant based on the P-value.

7. Model Cross-Validation

When we run a train-test split, random samples of data are created for the training and the test sets. There is a problem with this method though. The training and test MSE (Mean Squared Error) strongly depends on how the training and test sets were created and divided which can return inaccurate or misleading results.

Here comes the K-Fold Cross Validation method that expands on the idea of training and test splits by splitting the entire dataset into K equal sections of data. K-Fold averages the individual results from each of these K linear models to get a Cross-Validation MSE. This provides a more accurate model's MSE since all the "noisy" results that are higher than average will balance off the "noisy" results that are lower than average.

# Perform 5-fold cross-validation to get the mean squared error through scikit-learn
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score

mse = make_scorer(mean_squared_error)

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring=mse)

v_5_results.mean()


0.0779517061955943

cv_5_results

array([0.07984484, 0.07901671, 0.07629246, 0.08107051, 0.07353401])

8. Checking Residuals

As we all know there are four linear regression assumptions. linearity, homoscedasticity, independence, normality - all of them pertain to the residuals. Technically, you don't need those assumptions to fit a linear model. However, if you violate them, your parameter estimates could be biased or not have the minimum variance. and it will be more difficult to interpret the regression results.

Let's now take a closer look at the residuals and check if normality, homoscedasticity, and linearity assumptions are fulfilled. The independence assumption was verified when we were checking for multicollinearity of features in point 4.

A quick for homoscedasticity assumption:

To detect nonlinearity we can inspect plots of observed vs. predicted values or residuals vs. predicted values. The desired outcome is that points are symmetrically distributed around a diagonal line in the former plot or around a horizontal line in the latter one. In both cases with a roughly constant variance.

fig, axs = plt.subplots(ncols=2, figsize = (18,6))

#Train set
axs[0].scatter(y_train, y_hat_train, label='Model')
axs[0].plot(y_train, y_train, label='Actual data', color='red')
axs[0].set_title('Model vs data for training set')
axs[0].legend();

# Test set
axs[1].scatter(y_test, y_hat_test, label='Model')
axs[1].plot(y_test, y_test, label='Actual data',  color='red')
axs[1].set_title('Model vs data for test set')
axs[1].legend();

9. Calculate bias and variance

def variance(y_hat):
    return np.mean([yi**2 for yi in y_hat]) - np.mean(y_hat)**2

def bias(y, y_hat):
    return np.mean(y_hat - y)
# Bias and variance for training set 
b = bias(y_train, y_hat_train) 
v = variance(y_hat_train) 
print('Train bias: {} \nTrain variance: {}'.format(b, v))
# Bias and variance for test set 
b = bias(y_test, y_hat_test) 
v = variance(y_hat_test) 
print('Test bias: {} \nTest variance: {}'.format(b, v))

The results are

Train bias: -6.460616181311405e-17 
Train variance: 0.7174200043327994

Test bias: -0.01053133908041892 
Test variance: 0.7625989736490899