# Load the required libraries, suppressing annoying startup messages
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tibble, quietly = TRUE, warn.conflicts = FALSE)
library(ggplot2, quietly = TRUE, warn.conflicts = FALSE) # For data visualization
library(ggpubr, quietly = TRUE, warn.conflicts = FALSE) # For data visualization
# Read the mtcars dataset into a tibble called tb
data(mtcars)
<- as_tibble(mtcars)
tb # Convert relevant columns into factor variables
$cyl <- as.factor(tb$cyl) # cyl = {4,6,8}, number of cylinders
tb$am <- as.factor(tb$am) # am = {0,1}, 0:automatic, 1: manual transmission
tb$vs <- as.factor(tb$vs) # vs = {0,1}, v-shaped engine, 0:no, 1:yes
tb$gear <- as.factor(tb$gear) # gear = {3,4,5}, number of gears tb
Continuous x Continuous data (2 of 2)
Sep 17, 2023.
Exploring bivariate Continuous x Continuous data, using ggplot2
This chapter demonstrates the use of the popular ggplot2
and ggpubr
packages to further explore the interaction between bivariate continuous data.
Data: Suppose we run the following code to prepare the mtcars
data for subsequent analysis and save it in a tibble called tb
.
Scatterplot using ggplot2
ggplot(tb,
aes(x = hp, y = mpg)) +
geom_point() +
xlab("Horsepower of Engine") +
ylab("Miles per gallon") +
ggtitle("Scatter Plot of mpg vs. hp")
Discussion:
The
ggplot2
package uses a layering approach, enabling users to build plots incrementally, piece by piece, using a combination of data, aesthetics, and geometric objects.The function
ggplot()
initializes the plotting system. It requires a dataset to operate on and an aesthetic mapping to determine how data variables will be plotted. Here, the dataset is represented bytb
.Inside the
aes()
function, which stands for aesthetics, the code specifies that the variablehp
from thetb
data frame will be plotted on the x-axis and the variablempg
will be plotted on the y-axis. Hence, the resulting plot will display a relationship between horsepower (hp
) and miles per gallon (mpg
).The
geom_point()
function is an added layer, instructingggplot2
to render the relationship betweenhp
andmpg
as a scatter plot, with individual data points being represented as points.The functions
xlab()
andylab()
are used to set custom labels for the x and y axes, respectively. In this code, the x-axis is labeled as “Horsepower of Engine” and the y-axis is labeled as “Miles per gallon”.Finally, the
ggtitle()
function is used to assign a title to the entire plot. In this instance, the title is set as “Scatter Plot of mpg vs. hp”, clearly indicating the purpose and content of the visualization.
ggscatter(tb,
x = "hp", y = "mpg",
shape = 17)
# Color by continuous variable
ggscatter(tb,
x = "hp", y = "mpg",
color = "mpg") +
gradient_color(c("blue", "gray", "red"))
# Add density distribution as marginal plot
library("ggExtra")
<- ggscatter(tb,
p x = "hp", y = "mpg",
shape = 17)
# Change marginal plot type
ggMarginal(p, type = "boxplot")
# Add density distribution as marginal plot
library("ggExtra")
<- ggscatter(tb,
p x = "hp", y = "mpg",
color = "cyl")
# Change marginal plot type
ggMarginal(p, type = "boxplot")
Scatterplot with Regression line using ggplot2
ggplot(tb, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") + # Added this line for the regression
xlab("Horsepower of Engine") +
ylab("Miles per gallon") +
ggtitle("Scatter Plot of mpg vs. hp with Regression Line")
ggplot(tb, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm",
se = TRUE,
color = "red") + # Added this line for the regression
xlab("Horsepower of Engine") +
ylab("Miles per gallon") +
ggtitle("Scatter Plot of mpg vs. hp with Regression Line")
Discussion:
geom_smooth(method = "lm", se = FALSE, color = "blue")
: This function adds a smoothed conditional mean.- The
method = "lm"
argument indicates that a linear model (i.e., a regression line) should be used for smoothing. This line will depict the overall trend in the data. - If
se = FALSE
then the standard error bands (which show the uncertainty around the regression line) aren’t plotted. This determines whether or not the standard error bands (or confidence interval bands) are displayed around the smoothing line. In the case of linear regression (method = “lm”), these bands represent the 95% confidence interval around the predicted values. This means that if you were to repeatedly sample from the population and fit a regression model each time, you’d expect about 95% of the confidence intervals to contain the true regression line.
- The
ggscatter(tb,
x = "hp", y = "mpg",
add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = "blue",
fill = "lightgray")
+
)stat_cor(method = "pearson", label.x = 3, label.y = 30) # Add correlation coefficient
Scatterplots with Categorical Variables
Scatterplot colored by a Categorical variable, using ggplot()
This will create a scatterplot of miles per gallon (mpg
) against horsepower (hp
), with each point colored according to the number of cylinders (cyl
) in the engine.
# Create a Scatterplot of mpg vs. hp, colored by cyl
ggplot(tb, aes(x = hp,
y = mpg,
color = factor(cyl))) +
geom_point() +
labs(x = "Horsepower", y = "Miles per gallon") +
scale_color_discrete(name = "Cylinders") +
ggtitle("Scatter Plot of mpg vs. hp, for cyl={4,6,8}")
Discussion:
The
aes()
function, short for aesthetics, designates the variables and their roles in the plot. In this code:- The
hp
variable is plotted on the x-axis. - The
mpg
variable is mapped to the y-axis. - The
color
attribute is set based on thecyl
variable, which presumably indicates the number of cylinders in a car engine. The use offactor(cyl)
ensures that thecyl
variable is treated as a discrete factor rather than a continuous variable, which is essential for color differentiation.
- The
geom_point()
introduces a scatter plot layer, meaning that the relationship betweenhp
andmpg
will be represented using individual points, with each point’s color reflecting the number of cylinders as specified in the aesthetic mapping.The
labs()
function provides a convenient way to label the axes. Here, the x-axis receives the label “Horsepower” and the y-axis is labeled “Miles per gallon”.The
scale_color_discrete()
function customizes the color scale for discrete variables. By specifying thename
argument as “Cylinders”, it ensures that the legend accompanying the color scale in the plot will be labeled as “Cylinders”, making it clear to viewers that the colors of the points represent different cylinder counts.
ggscatter(tb,
x = "hp", y = "mpg",
color = "cyl", # Color by groups "cyl"
shape = "cyl", # Change point shape by groups "cyl"
rug = TRUE # Add marginal rug
)
# Extending the regression line --> fullrange = TRUE
# Add marginal rug (marginal density) ---> rug = TRUE
ggscatter(tb,
x = "hp", y = "mpg",
add = "reg.line", # Add regression line
color = "cyl", # Color by groups "cyl"
shape = "cyl", # Change point shape by groups "cyl"
fullrange = TRUE, # Extending the regression line
rug = TRUE # Add marginal rug
+
) stat_cor(aes(color = cyl),
label.x = 3) # Add correlation coefficient
ggscatter(tb,
x = "hp", y = "mpg",
add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
color = "cyl", # Color by groups "cyl"
shape = "cyl" # Change point shape by groups "cyl"
)
Scatterplot faceted by a Categorical variable, using ggplot()
This will create a scatterplot of miles per gallon (mpg) against weight, with each plot faceted by the number of cylinders in the engine (cyl).
# Create a Scatterplot of mpg vs. hp, faceted by cyl
ggplot(tb,
aes(x = hp,
y = mpg,
color = cyl)) +
geom_point() +
facet_grid(. ~ cyl) +
ggtitle("Scatter Plot of mpg vs. hp, for cyl={4,6,8}")
Discussion:
The foundational layer is initialized with the
ggplot()
function. This function takes in a dataset,tb
, and aesthetic mappings that determine how variables are displayed. In this piece of code:hp
is chosen to be plotted on the x-axis.mpg
is selected for the y-axis.- The color of the points will be determined by the
cyl
variable.
The addition of the
geom_point()
layer ensures that a scatter plot will represent the relationship betweenhp
andmpg
. Each point’s color will correspond to the value of thecyl
variable.The
facet_grid()
function introduces the concept of faceting. Faceting divides a plot into multiple panels based on the levels of one or more factors. In this case, the plot is faceted horizontally (~ cyl
), meaning that separate panels are created for each unique value ofcyl
. The.
before the~
indicates that there’s no faceting vertically.Finally, the
ggtitle()
function provides the entire plot with a title, which is “Scatter Plot of mpg vs. hp, for cyl={4,6,8}”. This title clearly communicates the main theme of the plot and indicates that it showcases relationships for cars with 4, 6, or 8 cylinders.
# Color by continuous variable
ggscatter(tb,
x = "hp", y = "mpg",
color = "mpg",
facet.by = "cyl"
)
ggplot(tb,
aes(x = hp,
y = mpg,
color = cyl)) +
geom_point() +
facet_grid(cyl ~ .) +
ggtitle("Scatter Plot of mpg vs. hp, Faceted Vertically by cyl")
Discussion:
The primary difference between the two code snippets lies in how the faceting is implemented using the
facet_grid()
function.In the original code,
facet_grid(. ~ cyl)
is used, which means the scatter plots are faceted horizontally based on the unique values of thecyl
variable; each unique cylinder count gets its own column.Conversely, in the updated code with
facet_grid(cyl ~ .)
, the scatter plots are faceted vertically based on the unique values of thecyl
variable; each unique cylinder count gets its own row.
ggplot(tb,
aes(x = hp,
y = mpg,
color = cyl)) +
geom_point() +
facet_wrap(~ cyl, ncol = 3) +
ggtitle("Scatter Plot of mpg vs. hp, Wrapped Facets by cyl")
Discussion:
This approach creates a wrapped grid of facets based on
cyl
.The
ncol = 3
argument specifies that up to three facets will be placed in a row before wrapping to the next row. You can adjust this as needed based on the number of levels in the faceting variable and the desired layout.
ggplot(tb,
aes(x = hp,
y = mpg,
color = cyl)) +
geom_point() +
facet_grid(cyl ~ gear) +
ggtitle("Scatter Plot of mpg vs. hp, Faceted by cyl and gear")
Discussion:
In this code, within the
aes()
aesthetics function:- The variable
hp
is mapped to the x-axis. - The variable
mpg
is mapped to the y-axis. - The color of individual points is determined by the
cyl
variable, which probably represents the number of cylinders in an engine.
- The variable
The
geom_point()
function is introduced to represent the relationship betweenhp
andmpg
as a scatter plot. The colors of the individual points will correspond to the values of thecyl
variable.The
facet_grid(cyl ~ gear)
function is the standout feature in this code. Here, the plots are faceted based on two categorical variables:cyl
, which is mapped to rows. Each unique value ofcyl
will generate a new row of plots.gear
, which is mapped to columns. Each unique value ofgear
will generate a new column of plots.- The resultant grid will represent combinations of
cyl
andgear
values, with each cell in the grid showing the relationship betweenhp
andmpg
for a specific combination ofcyl
andgear
.
# Color by continuous variable
ggscatter(tb,
x = "hp", y = "mpg",
color = "cyl",
facet.by = c("cyl","gear")
)
Scatterplot colored by a Categorical variable, with textual annotation, using ggpubr()
# Textual annotation
ggscatter(tb,
x = "hp",
y = "mpg",
color = "cyl",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
label = "mpg",
repel = TRUE)
Bubble Chart
ggscatter(tb,
x = "hp", y = "mpg",
color = "cyl",
size = "wt", alpha = 0.5) +
scale_size(range = c(0.5, 15)) # Adjust the range of points size
References
[1] Everitt, B. S., & Hothorn, T. (2014). A Handbook of Statistical Analyses Using R. Chapman and Hall/CRC.