# StatsNotebook

## Scatterplot

23 Oct, 2020 | 8 minutes read

Tags: DataViz, R code

The tutorial is based on R and StatsNotebook, a graphical interface for R.

Scatterplot can be used to visualise the association between two numeric variables. StatsNotebook uses the `geom_jitter` function from the `ggplot2` library to build scatterplot.

We use the built-in UNDP dataset in this example. This dataset can be loaded into StatsNotebook using instruction here or can be downloaded from here . This is a dataset of 199 countries compiled from the United Nations Development Programme.

We will use the following three variables from this dataset

1. HDI - Human Development Index in 2018
2. GDP - Gross Domestic Product per capital in 2018
3. Pop - Population (in millions) in 2018
4. Continent - Continent
5. Schooling - Expected years of schooling

This dataset can also be loaded using the following codes

``````library(tidyverse)
``````

In this example, we will build

To build a simple scatterplot visualising association between two numeric variables (e.g. HDI and GDP),

1. Click DataViz at the top
2. Click Correlation
3. Select Scatterplot from the menu
4. In the Scatterplot panel, select HDI to Vertical Axis and GDP to Horizontal Axis.
5. Expand the Scatterplot Setting panel and uncheck Add a fitted line (A line of best fit will be added to the plot by default).
6. Expand the Label, Settings and Theme panel
• Click Switch to Advanced Mode for customisable setting (e.g. font size, color schemes, etc).
• For Title, we used “Scatterplot of Gross domestic product and Human development index”.
• For Horizontal axis, we used “Gross domestic product (GDP, 2018 US\$)”.
• For Vertical axis, we used “Human development index”.
7. Click Code and Run
``````currentDataset %>%
ggplot(aes(y = HDI, x = GDP)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Gross domestic product and Human development index")+
xlab("Gross domestic product (GDP, 2018 US\$)")+
ylab("Human development index")+
theme(legend.position = "bottom")
``````

The variables for the vertial (y) and horizontal (x) are specified in the function `ggplot`. For scatterplot, we use `geom_jitter` instead of `geom_point` to add a small amount of noise to each point so that the points will not be overlapping with each other. To build a bubble plot visualising the association between two numeric variables (e.g. HDI and GDP), with point size determined by a third continuous variable (e.g. Population size) and points color-coded by a categorical variable (e.g. continent)

1. Click DataViz at the top
2. Click Correlation
3. Select Scatterplot from the menu
4. In the Scatterplot panel, select HDI to Vertical Axis, GDP to Horizontal Axis, Continent to Fill color, and Pop to Size.
5. Expand the Scatterplot Setting panel and uncheck Add a fitted line (A line of best fit will be added to the plot by default).
6. Expand the Label, Settings and Theme panel
• Click Switch to Advanced Mode for customisable setting (e.g. font size, color schemes, etc).
• For Title, we used “Scatterplot of Gross domestic product and Human development index”.
• For Horizontal axis, we used “Gross domestic product (GDP, 2010 US\$)”.
• For Vertical axis, we used “Human development index”.
• For Color fill label, we used “Continent”.
• For Size label, we used “Population (millions)”.
• Since we have two legends, one for Color and one for Size, we set legend position to “Right”.
7. Click Code and Run
``````currentDataset %>%
drop_na(Continent, Pop) %>%
ggplot(aes(y = HDI, x = GDP, size = Pop)) +
geom_jitter(alpha = 0.5, aes(color = Continent), na.rm = TRUE)+
scale_size(range = c(0.1, 8))+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Gross domestic product and Human development index")+
xlab("Gross domestic product (GDP, 2018 US\$)")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom")

`````` Part of the legends is cut-off (bottom left) because we are running out of space. We can place the legend to the right hand size of the plot by changing the last line of codes.

``````theme(legend.position = "right")
`````` The size of the bubble can be changed by changing the following line

``````scale_size(range = c(0.1, 8))
``````

The minimum point size is 0.1 and the maximum is 8. The following code changes the minimum to 0.3 and the maximum to 15.

``````scale_size(range = c(0.3, 15))
`````` To build a scatterplot visualising the association between two numeric variables (e.g. HDI and Schooling) with points color-coded by a categorical variable (e.g. continent) and an overall line of best fit,

1. Click DataViz at the top
2. Click Correlation
3. Select Scatterplot from the menu
4. In the Scatterplot panel, select HDI to Vertical Axis, Schooling to Horizontal Axis, and Continent to Fill color.
5. Expand the Scatterplot Setting panel
• check Add a fitted line (A line of best fit will be added to the plot by default).
• check By fill color.
6. Expand the Label, Settings and Theme panel
• Click Switch to Advanced Mode for customisable setting (e.g. font size, color schemes, etc).
• For Title, we used “Scatterplot of Years of schooling and Human development index”.
• For Horizontal axis, we used “Years of schooling”.
• For Vertical axis, we used “Human development index”.
• For Color fill label, we used “Continent”.
• For Size label, we used “Population (millions)”.
• Since we have two legends, one for Color and one for Size, we set legend position to “Right”.
7. Click Code and Run
``````currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling)) +
geom_jitter(alpha = 0.6, aes(color = Continent), na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
theme(legend.position = "bottom")
``````

In the above R codes, we have specified the vertial axis (y-axis) to be HDI and horizontal to be Schooling globally in the line

``````ggplot(aes(y = HDI, x = Schooling))
``````

and then requested the points to be colored by Continent in the `geom_jitter` function.

``````geom_jitter(alpha = 0.6, aes(color = Continent), na.rm = TRUE)
``````

To specify one line of best fit for each Continent, we can specified the color globally as shown in the following example. To build a scatterplot visualising the association between two numeric variables (e.g. HDI and Schooling) with a line of best fit and points color-coded by a categorical variable (e.g. continent),

1. Click DataViz at the top
2. Click Correlation
3. Select Scatterplot from the menu
4. In the Scatterplot panel, select HDI to Vertical Axis, Schooling to Horizontal Axis, and Continent to Fill color.
5. Expand the Scatterplot Setting panel, check Add a fitted line (A line of best fit will be added to the plot by default).
6. Expand the Label, Settings and Theme panel
• Click Switch to Advanced Mode for customisable setting (e.g. font size, color schemes, etc).
• For Title, we used “Scatterplot of Years of schooling and Human development index”.
• For Horizontal axis, we used “Years of schooling”.
• For Vertical axis, we used “Human development index”.
• For Color fill label, we used “Continent”.
• For Size label, we used “Population (millions)”.
• Since we have two legends, one for Color and one for Size, we set legend position to “Right”.
7. Click Code and Run
``````currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling, color = Continent)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom")
``````

In this example, the color of the point is specified globally in the `ggplot` function

``````ggplot(aes(y = HDI, x = Schooling, color = Continent))
``````

As a result, separate lines of best fit will be added for each continent. The above plot of years of schooling and human development index is a bit busy. We can use the `facet` to plot separate scatterplots by continent by adding an extra line of codes.

``````    facet_wrap( ~ Continent)
``````

Below is the complete code for generating scatterplots by continent in different facets.

``````currentDataset %>%
drop_na(Continent) %>%
ggplot(aes(y = HDI, x = Schooling, color = Continent)) +
geom_jitter(alpha = 0.6, na.rm = TRUE)+
geom_smooth(method = "lm", se = TRUE, level = 0.95, na.rm = TRUE, show.legend = FALSE)+
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set2")+
theme_bw(base_family = "sans")+
ggtitle("Scatterplot of Years of schooling and Human development index")+
xlab("Years of schooling")+
ylab("Human development index")+
labs(color = "Continent", fill = "Continent")+
labs(size = "Population (millions)")+
theme(legend.position = "bottom") +
facet_wrap( ~ Continent)
`````` 