How to interpret Linear Regression Model

Janaki
5 min readMar 7, 2021

--

This blog post is all about an introduction to linear regression and interpreting the coefficient values. Let us also see the RMSE and R-Square value interpretation of the model.

Introduction to Linear Regression Model:

Linear regression is a machine learning algorithm that is used for prediction. In this current world of uncertainty, we need to predict certain things to keep things on track. For example, the sales marketing manager wants to know the demand for drugs taking into account all seasonal variations to help the drug manufacturing companies. Here is the equation for the linear regression model.

y → dependent variable or predicted value

x→independent variable or predictor variable

B0→intercept

B1→coefficient

The above equation is being used to predict the values in our linear model. If we are building a multiple linear regression model, then the different variables are denoted as x1,x2,x3, and so on.

Linear Regression is identifying the relationship between the variables and representing that relationship using a straight line.

Building basic Linear regression model:

For this blog, I have taken the King County House Sale dataset. The data consist of house sale information in King County Area from 2014 to 2015.

Let us consider the features of the dataset.

The dataset requires data cleaning as it contains null values and conversion of data types. After all those processing, let us build a base model. Let us plot a heat map to check the multicollinearity between the variables.

Our target variable for this model is the Price. Hence we are going to predict the price value of the house given the house features. With the heat map's help, let’s take few columns that are linear with the price.

We have created a basic linear regression model without any feature engineering and calculated our model’s RMSE and R_Square values. Now it's time to interpret our model.

Interpreting the RMSE and R-Square :

RSS:

R-Square value is calculated by the above equation where RSS is the squared difference between the actual value and the predicted value. Simultaneously, the TSS is the squared difference between the actual value and the mean of the actual value. R_Square can be any value between 0 and 1, with 1 being the best fit model. Here in our model, the R-Square value is 0.57, which means the model shows 57% accuracy. We can say 57% variations in the target variable are explained by our model's predictor variable in different terms.

RMSE:

RMSE(Root Mean Squared Error) is nothing but the distance between the actual and the predicted value. It is the square-root of the variance of the residuals. Our model RMSE value is 242081.30, which means our model is $242k off our actual sales price. This is a huge difference. Hence we need to modify and tune our model for better prediction and to decrease our RMSE value. The lower the RMSE value, the better is our model prediction. Considering the below image, the smaller the distance between the red dot and the blue line better will be our model performance in terms of errors.

Interpreting the coefficient and P_values:

Let’s look at the final improved model's coefficient, which I did at the back end after some feature engineering. I haven’t shown the feature engineering as we are focusing here on the interpretation of the model. Here is a link to my GitHub repository, where you can view the entire code. https://github.com/JanakiGanesh/House-Price-Prediction-in-King-Count

Here is a small code to calculate the P-value of the coefficients. A P-value less than 0.05 means that they are statistically significant, rejecting the null hypothesis. But the size of the P-value for the coefficient does nothing on the size of the effect that variable is having on your dependent variable.

So if you look at the coefficient value below, negative values negatively correlate with the price, and positive values positively correlate with the price. For example, the ‘sqft_living_log’ coefficient has a value of 0.108, which means that the house price increases by 10% for each square foot increase on average.

If your target (Price) and the predictor variable are log-transformed, our price value is no longer in dollars. Let us consider 3 scenarios.

  • The Target variable is log-transformed
  • The Predictor variable is log-transformed
  • Both target and predictor variable is log-transformed

The Target variable is log-transformed:

If only the target variable is log-transformed then our interpretation would be exponent(coefficient)-1]*100 . For example consider sqft_living_log’ coefficient which is 0.108 so (exp(0.108)-1)*100=(1.11–1)*100=11. For each unit increases in predictor variable , the target variable increases by 11%

The Predictor variable is log-transformed:

If only the predictor variable is log-transformed, then our interpretation would be coefficient/100 is 0.108/100=0.0010. For every 1% increase in the predictor variable, the target variable increases by 0.0010%.

Both target and predictor variable are log-transformed:

If both the target and predictor variables are log-transformed, then the ‘sqft_living_log’ coefficient of 0.108 can be interpreted as for every 1% increase in predictor variable, the target variable increases by 0.10%.

Conclusion:

In this blog post, we just discussed how to interpret our Linear Regression model values. The interpretation varies depending on the business problem and the dataset that we are handling. As I am strengthening my data science skills, I would like to discuss various machine learning models in my future posts.

--

--

No responses yet