Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.
The tutorial is based on R and StatsNotebook, a graphical interface for R.
Scatterplot can be used to visualise the association between two numeric variables. StatsNotebook uses the geom_jitter
function from the ggplot2
library to build scatterplot.
We use the built-in UNDP dataset in this example. This dataset can be loaded into StatsNotebook using instruction here or can be downloaded from here . This is a dataset of 199 countries compiled from the United Nations Development Programme.
We will use the following three variables from this dataset
This dataset can also be loaded using the following codes
library(tidyverse)
currentDataset <- read_csv("https://statsnotebook.io/blog/data_management/example_data/HDI_countries.csv")
In this example, we will build
To build a simple scatterplot visualising association between two numeric variables (e.g. HDI and GDP),
currentDataset %>%
ggplot(aes(y = HDI, x = GDP)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Gross domestic product and Human development index")+
xlab("Gross domestic product (GDP, 2018 US$)")+
ylab("Human development index")+
theme(legend.position = "bottom")
The variables for the vertial (y) and horizontal (x) are specified in the function ggplot
. For scatterplot, we use geom_jitter
instead of geom_point
to add a small amount of noise to each point so that the points will not be overlapping with each other.
To build a bubble plot visualising the association between two numeric variables (e.g. HDI and GDP), with point size determined by a third continuous variable (e.g. Population size) and points color-coded by a categorical variable (e.g. continent)
currentDataset %>%
drop_na(Continent, Pop) %>%
ggplot(aes(y = HDI, x = GDP, size = Pop)) +
geom_jitter(alpha = 0.5, aes(color = Continent), na.rm = TRUE)+
scale_size(range = c(0.1, 8))+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Gross domestic product and Human development index")+
xlab("Gross domestic product (GDP, 2018 US$)")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom")
Part of the legends is cut-off (bottom left) because we are running out of space. We can place the legend to the right hand size of the plot by changing the last line of codes.
theme(legend.position = "right")
The size of the bubble can be changed by changing the following line
scale_size(range = c(0.1, 8))
The minimum point size is 0.1 and the maximum is 8. The following code changes the minimum to 0.3 and the maximum to 15.
scale_size(range = c(0.3, 15))
To build a scatterplot visualising the association between two numeric variables (e.g. HDI and Schooling) with points color-coded by a categorical variable (e.g. continent) and an overall line of best fit,
currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling)) +
geom_jitter(alpha = 0.6, aes(color = Continent), na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
theme(legend.position = "bottom")
In the above R codes, we have specified the vertial axis (y-axis) to be HDI and horizontal to be Schooling globally in the line
ggplot(aes(y = HDI, x = Schooling))
and then requested the points to be colored by Continent in the geom_jitter
function.
geom_jitter(alpha = 0.6, aes(color = Continent), na.rm = TRUE)
To specify one line of best fit for each Continent, we can specified the color globally as shown in the following example.
To build a scatterplot visualising the association between two numeric variables (e.g. HDI and Schooling) with a line of best fit and points color-coded by a categorical variable (e.g. continent),
currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling, color = Continent)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom")
In this example, the color of the point is specified globally in the ggplot
function
ggplot(aes(y = HDI, x = Schooling, color = Continent))
As a result, separate lines of best fit will be added for each continent.
The above plot of years of schooling and human development index is a bit busy. We can use the facet
to plot separate scatterplots by continent by adding an extra line of codes.
facet_wrap( ~ Continent)
Below is the complete code for generating scatterplots by continent in different facets.
currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling, color = Continent)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom") +
facet_wrap( ~ Continent)
Follow our Facebook page or our developer’s Twitter for more tutorials and future updates.