Tags: Data management, R code
Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.
The tutorial is based on R and StatsNotebook, a graphical interface for R.
StatsNotebook provides a simple menu for creating new variables. We will give two examples below to demonstrate using StatsNotebook to compute new variables.
In these two examples, we will use the built-in Personality dataset. This dataset can be loaded into StatsNotebook using instruction here or can be downloaded from here.
Violation of distributional assumption is common. For example, linear regression requires that the residuals to be normally distribution. In our linear regression example, we regress depression on the Big Five personality factors and sex. The normality plot (QQ plot) from the regression model indicates that the residuals are not normally distributed. Below is the normality plot from that example.
This is largely because depression is positively skewed (a long tail on the right hand side).
One way to improve the model is to perform a log-transformation of the dependent variable, depression.
To create a log-transformed version of depression,
The following code will be generated.
currentDataset$log_depression <- log(currentDataset$Depression)
Histogram of this new variable indicates that it is much less skewed.
The residual plot from the linear regression using the log-transformed depression variable shows little evidence of violating the normality assumption.
A quadratic term of an independent variable is often added to a regression model to test for curvilinear relationship. Suppose that we want to test if there is a curvilinear relationship between Depression and Neuroticism, we can create a quadratic term of Neruoticism and enter it into a linear regression model.
To create a quadratic term,
The following code will be generated.
currentDataset$Neuroticism_sq <- currentDataset$Neuroticism^2
Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.