Tags: Analysis, R code, Propensity score, Causal inference
Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.
The tutorial is based on R and StatsNotebook, a graphical interface for R.
Randomised controlled trials (RCTs) are the gold standard for causal inference. In RCTs, randomisation removes all confoundings between the treatment (exposure) and control group. However, RCTs are not always feasible due to ethical or logistic issues (e.g., testing the causal impact of smoking on lung cancer). Propensity score matching can be used to emulate the balance between treatment and control group in an observational study. At its simplest, propensity score matching matches each individual in the treatment group to an individual in the control group based on their propensity score. For each individual, the propensity score can be intuitively considered as the probability of recentiving treatment, calcuated from a range of covariates (and potential confounders). Two individuals, one from treatment and one from control group, is considered to be a match if the difference between their propensity score is small. Unmatched participants are discarded.
After matching, the treatment group and the control group should have very similar characteristics. A simple regression model can be used to estimate the treatment effect on the outcome. Cluster-robust standard errors are required for correct inference.
In this setup, the treatment variable is smoking; the outcome variable is psychological distress. This research question will be addressed by propensity score matching. We will first match each smoker in the dataset to a non-smoker based on a range of covariates, including sex (male/ female), indigenous status (indigenous or not), education level (high school completion), marital status (partnered or not), region of residence (major cities, inner regional or outer regional), language background (English speaking or not), age and risky alcohol use (Yes/ No).
Propensity score matching analysis involves two steps.
We will first show the R code for this analysis below, and we will provide a step-by-step guide on how to complete this analysis. We use the Psychological distress and smoking dataset, which can be loaded using the following codes (internet connection is required)
library(tidyverse)
currentDataset <- read_csv("https://raw.githubusercontent.com/gckc123/ExampleData/main/smoking_psyc_distress.csv")
There are ten variables in this dataset.
The following is the compete codes for our propensity score matching example.
#Since remoteness is a categorical variable with more than two categories. It is necessary to convert
#it into a factor variable.
#For other categorical variable with only 2 levels, this is optional if the variable is coded as 0 and 1.
currentDataset$remoteness <- factor(currentDataset$remoteness, exclude = c("", NA))
#The MatchIt, lmtest and sandwich libraries are used.
library(MatchIt)
library(lmtest)
library(sandwich)
#Using the mathcit function from MatchIt to match each smoker with a non-smoker (1 to 1 matching) based on
#sex, indigeneity status, high school completion, marital status (partnered or not),
#region of residence (major cities, inner regional, outer regional), language background (English speaking Yes/No)
#and risky alcohol drinking (Yes/No)
match_obj <- matchit(smoker ~ sex + indigeneity + high_school + partnered + remoteness + language + risky_alcohol + age,
data = currentDataset, method = "nearest", distance ="glm",
ratio = 1,
replace = FALSE)
summary(match_obj)
#plotting the balance between smokers and non-smokers
plot(match_obj, type = "jitter", interactive = FALSE)
plot(summary(match_obj), abs = FALSE)
Prior to running the propensity score matching, it is always a good practice to visualise the data and conduct a descriptive analysis.
To conduct propensity score matching,
The following is a key section of the generated codes.
match_obj <- matchit(smoker ~ sex + indigeneity + high_school + partnered + remoteness + language + risky_alcohol + age,
data = currentDataset, method = "nearest", distance ="glm",
ratio = 1,
replace = FALSE)
summary(match_obj)
This code tells R to run a propensity score matching using the matchit function from the MatchIt library. The left side of the “~” symbol specifies the exposure variable; the right side specifies the covariates. The results from this propensity score matching is then printed out using the summary function.
These settings generally produce satisfactory results and will be sufficient in many scenarios.
This code produces the result summary below.
######################################################
Call:
matchit(formula = smoker ~ sex + indigeneity + high_school +
partnered + remoteness + language + risky_alcohol + age,
data = currentDataset, method = "nearest", distance = "glm",
replace = FALSE, ratio = 1)
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.1852 0.1130 0.6304 2.0778 0.2033
sex 0.4938 0.4421 0.1035 . 0.0518
indigeneity 0.0524 0.0175 0.1565 . 0.0349
high_school 0.4220 0.6378 -0.4370 . 0.2158
partnered 0.4630 0.6913 -0.4578 . 0.2283
remoteness0 0.5852 0.6773 -0.1870 . 0.0921
remoteness1 0.2177 0.1900 0.0670 . 0.0277
remoteness2 0.1971 0.1327 0.1621 . 0.0645
language 0.9579 0.9130 0.2234 . 0.0449
risky_alcohol 0.6427 0.5411 0.2120 . 0.1016
age 51.6057 53.7824 -0.1676 0.8214 0.0441
eCDF Max
distance 0.3185
sex 0.0518
indigeneity 0.0349
high_school 0.2158
partnered 0.2283
remoteness0 0.0921
remoteness1 0.0277
remoteness2 0.0645
language 0.0449
risky_alcohol 0.1016
age 0.1020
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.1852 0.1847 0.0042 1.0209 0.0002
sex 0.4938 0.5010 -0.0144 . 0.0072
indigeneity 0.0524 0.0370 0.0691 . 0.0154
high_school 0.4220 0.4107 0.0229 . 0.0113
partnered 0.4630 0.4764 -0.0268 . 0.0133
remoteness0 0.5852 0.5986 -0.0271 . 0.0133
remoteness1 0.2177 0.2146 0.0075 . 0.0031
remoteness2 0.1971 0.1869 0.0258 . 0.0103
language 0.9579 0.9620 -0.0205 . 0.0041
risky_alcohol 0.6427 0.6407 0.0043 . 0.0021
age 51.6057 51.2064 0.0307 0.9171 0.0150
eCDF Max Std. Pair Dist.
distance 0.0113 0.0052
sex 0.0072 0.2854
indigeneity 0.0154 0.2627
high_school 0.0113 0.1725
partnered 0.0133 0.1874
remoteness0 0.0133 0.3230
remoteness1 0.0031 0.2463
remoteness2 0.0103 0.3407
language 0.0041 0.1329
risky_alcohol 0.0021 0.2957
age 0.0390 0.3717
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 99.3 97.2 99.9 96.5
sex 86.1 . 86.1 86.1
indigeneity 55.8 . 55.8 55.8
high_school 94.8 . 94.8 94.8
partnered 94.2 . 94.2 94.2
remoteness0 85.5 . 85.5 85.5
remoteness1 88.9 . 88.9 88.9
remoteness2 84.1 . 84.1 84.1
language 90.8 . 90.8 90.8
risky_alcohol 98.0 . 98.0 98.0
age 81.7 56.0 66.0 61.8
Sample Sizes:
Control Treated
All 7026 974
Matched 974 974
Unmatched 6052 0
Discarded 0 0
######################################################
In the top section of the results, the balance between smoker and non-smoker group in each of the covariates are shown. The column on Std. Mean Diff. shows that there are substantial difference in nearly all covariates between smokers and non-smokers (a standardised mean difference great than 0.1 can be considered as substantial difference).
In the middle section, the balance between smoker and non-smoker group after matching are shown. The standardised mean differences are now all close to zero for all covariates, indicating that a good balance is achieved.
The last section shows that there are 7026 non-smokers and 974 smokers in the original sample. All smokers are matched to a corresponding non-smokers. The remaining 6052 non-smokers are not matched to any smokers and can be discarded in subsequent analysis. Applied researchers sometimes are concerned about discarding a large number of unmatched participants. However, this is usually not an issue if all treated participants are matched, and good balance is archived.
The above codes also produce the following two figures. The first one visualises the distribution of propensity score in the matched treated group, matched control group and the unmatched control group. Propensity score matching will not be appropriate if there is not a satisfactory overlap in the propensity score distribution between the matched treated group and the matched untreated group. The second figure visualises the balance in propensity score and all covariates before and after matching. Similar to the results from the table, there are substantial difference in all covariates between smokers and non-smokers before matching, but the difference is close to zero after matching, indicating a very good balance.
Once we achieve a good balance between the treatment and control group after matching, we can proceed to the next step to analyse the difference in outcome between the two groups. We can analyse the data as if the data are from an RCT using a regression analysis. However, for correct inference, cluster-robust standard error is needed.
#Extract the matched data and save the data into the variable matched_data
matched_data <- match.data(match_obj)
#Run regression model with psychological distress as the outcome, and smoker as the only predictor
#We need to specify the weights - Matched participants have a weight of 1, unmatched participants
res <- lm(psyc_distress ~ smoker, data = matched_data, weights = weights)
#Test the coefficient using cluster robust standard error
coeftest(res, vcov. = vcovCL, cluster = ~subclass)
#Calculate the confidence intervals based on cluster robust standard error
coefci(res, vcov. = vcovCL, cluster = ~subclass, level = 0.95)
The following steps will generate the R codes for both matching and outcome analysis. It should be noted that we should only conduct outcome analysis after we achieve a satisfactory balance between treatment and control group.
The followings are the output from the outcome analysis. The results indicate that psychological distress among smokers is 1.66 point higher than non-smokers (95% CI [1.08, 2.24]).
######################################################
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.65298 0.20093 77.904 < 2.2e-16 ***
smoker 1.66016 0.29657 5.598 2.477e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
######################################################
2.5 % 97.5 %
(Intercept) 15.258922 16.047032
smoker 1.078545 2.241783
######################################################
Propensity score matching was used to estimate the effect of smoking on psychological distress. The propensity score were estimated using logistic regression based on sex, indigenous status, education level (completed high school or not), marital status (partnered or not), region of residence (major cities, inner regional or outer regional), language background (English speaking Yes/No), risky alcohol use (Yes/ No) and age. One-to-one nearest neighbour matching was used. All smokers (N = 974) were matched to a non-smoker. Good balance was achieved between the smoker and non-smoker group, with all standardized mean differences below 0.1 after matching.
To estimate the effect of smoking on psychological distress, a linear regression with psychological distress as the outcome and smoking status (smoker/ non-smoker) as the exposure was fitted. Cluster-robust variance was used to estimate the standard error. It is estimated that smoking increases psychological distress by 1.66 point (95% CI [1.08, 2.24], p < .001>).
All analyses were conducted in StatsNotebook (Chan and StatsNotebook team, 2020) using R 4.1 and the MatchIt package (Stuart, King, Imai and Ho, 2011).
Chan, G. and StatsNotebook Team (2020). StatsNotebook. [Computer Software]. Retrieved from https://www.statsnotebook.io
Stuart, King, Imai and Ho. (2011). MatchIt: nonparametric preprocessing for parametric causal inference. Journal of statistical software.
Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.