StatsNotebook

Residual plots and assumption checking

16 Oct, 2020 | 2 minutes read

Tag: Analysis


Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.

The tutorial is based on R and StatsNotebook, a graphical interface for R.

A residual plot is an essential tool for checking the assumption of linearity and homoscedasticity. The following are examples of residual plots when (1) the assumptions are met, (2) the homoscedasticity assumption is violated and (3) the linearity assumption is violated.

When both the assumption of linearity and homoscedasticity are met, the points in the residual plot (plotting standardised residuals against predicted values) will be randomly scattered.

assumption met

When the homoscedasticity assumption is violated, the “spread” of the points across predicted values are not the same. The following are two plots that indicate a violation of this assumption.

In the first plot, the variance (i.e. spread) of the residuals increases as the predicted values increase.

fanning out residuals

In the second plot, the variance (i.e. spread) of the residuals decreases as the predicted values increase.

fanning in residuals

Heteroscedasticity usually does not cause bias in the model estimates (i.e. regression coefficients), but it reduces precision in the estimates. The standard errors are often underestimated, leading to incorrect p-values and inferences.

There is no bullet-proof way to fix heteroscedasticity. But there are two common solutions that may reduce the problem.

  1. Conduct a robust regression with robust standard error.
  2. Transform the dependent variable. For example, when there is a fanning out pattern in the residual plot, applying a log-transformation on the dependent variable may mitigate the problem.

When the linearity assumption is violated, the points in the residual plot will not be randomly scattered. Instead, the points will often show some “curvature”.

non-linearity

There is no bullet-proof way to fix non-linearity. If there are multiple independent variables in a regression analysis, the first step is to identify the target independent variable that has a non-linear relationship with the dependent variable. The transgressing variable can usually be identified using the curvature test after a regression analysis. Once the transgressing variable is identified, its quadratic term (i.e. the squared term of the original variable) can be entered into the regression model.

Suppose the transgressing variable is x, its quadratic term can be created using the following line of code.

currentDataset$x_sq <- currentDataset$x^2

Quadratic term can also be created using StatsNotebook’s Compute menu.


Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.