# Load the required libraries, suppressing annoying startup messages
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tibble, quietly = TRUE, warn.conflicts = FALSE)
library(knitr) # For formatting tables
# Read the mtcars dataset into a tibble called tb
data(mtcars)
<- as_tibble(mtcars)
tb
# Convert relevant columns into factor variables
$cyl <- as.factor(tb$cyl) # cyl = {4,6,8}, number of cylinders
tb$am <- as.factor(tb$am) # am = {0,1}, 0:automatic, 1: manual transmission
tb$vs <- as.factor(tb$vs) # vs = {0,1}, v-shaped engine, 0:no, 1:yes
tb$gear <- as.factor(tb$gear) # gear = {3,4,5}, number of gears
tb
# Directly access the data columns of tb, without tb$mpg
attach(tb)
Bivariate Continuous data (Part 1 of 4)
Chapter 12.
This chapter explores Categorical x Continuous data. It explains how to summarize and visualize bivariate continuous data across categories. Here, we delve into the intersection of continuous data and categorical variables, examining how continuous data can be split, summarized, and compared across different levels of one or more categorical variables.
We bring to light methods for generating statistics per group and data manipulation techniques. This includes processes like grouping, filtering, and summarizing continuous data, contingent on categorical variables. We visualize such data by creating juxtaposed box plots, segmented histograms, and density plots that reveal the distribution of continuous data across varied categories.
Data: Suppose we run the following code to prepare the
mtcars
data for subsequent analysis and save it in a tibble calledtb
. [1]
Summarizing Continuous Data
Across one Category
- We review the use of the inbuilt functions (i)
aggregate()
; (ii)tapply()
; and the function (iii)describeBy()
from packagepysch
, to summarize continuous data split across a category.
- Using
aggregate()
- We use the
aggregate()
function to investigate the bivariate relationship between mileage (mpg
) and number of cylinders (cyl
). The following code displays a summary table showing the average mileage of the cars broken down by number of cylinders (cyl
= 4, 6, 8) usingaggregate()
. [1] [2]
# Aggregating 'mpg' column by the 'cyl' column.
# The aggregation function used is 'mean' to calculate the average.
<- aggregate(tb$mpg,
A0 by = list(tb$cyl),
FUN = mean)
# Renaming the column names of the aggregated data frame 'A0'.
# 'Cylinders' for the first column and 'Mean_mpg' for the second.
names(A0) <- c("Cylinders", "Mean_mpg")
# Displaying the aggregated data using 'kable' from the knitr package.
# Setting the number of digits to 2 for formatting and adding a caption.
kable(A0, digits=2,
caption = "Mean of Mileage (mpg) by Cylinder (cyl=4,6,8)")
Cylinders | Mean_mpg |
---|---|
4 | 26.66 |
6 | 19.74 |
8 | 15.10 |
- Discussion:
The first argument in
aggregate()
is the data vectortb$mpg
.The second argument,
by
, denotes a list of variables to group by. Here, we have suppliedtb$cyl
, since we wish to partition our data based on the unique values ofcyl
.The third argument,
FUN
, is the function we want to apply to each subset of data. We are usingmean
here, calculating the average mpg for each uniquecyl
value. We can alternately aggregate based on a variety of statistical functions includingsum
,median
,min
,max
,sd
,var
,length
,IQR
.The output of
aggregate()
is saved in a new tibble namedagg
. We utilize thenames()
function to rename the columns and displayagg
. [1]
- Using
tapply()
- The
tapply()
function is another convenient tool to apply a function to subsets of a vector, grouped by some factors.
# Applying the 'mean' function to the 'mpg' column,
# grouping the data by the 'cyl' column
# The 'tapply' function applies a function to each cell.
<- tapply(tb$mpg,
A1 $cyl,
tb
mean)print(A1)
4 6 8
26.66364 19.74286 15.10000
- Discussion:
In this code,
tapply(tb$mpg, tb$cyl, mean)
calculates the average miles per gallon (mpg
) for each unique number of cylinders (cyl
) within thetb
tibble.tb$mpg
represents the vector to which we want to apply the function.tb$cyl
serves as our grouping factor.mean
is the function that we’re applying to each subset of our data.The result will be a vector where each element is the average
mpg
for a unique number of cylinders (cyl
), as determined by the unique values oftb$cyl
. [1]
- Using
describeBy()
from packagepsych
- The
describeBy()
function, part of thepsych
package, can be used to compute descriptive statistics of a numeric variable, broken down by levels of a grouping variable. [3]
# Loading the 'psych' package
library(psych)
# Using the 'describeBy' from the 'psych' package for descriptive
# statistics of 'mpg' grouped by 'cyl'.
<- describeBy(mpg, cyl)
A2 print(A2)
Descriptive statistics by group
group: 4
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 11 26.66 4.51 26 26.44 6.52 21.4 33.9 12.5 0.26 -1.65 1.36
------------------------------------------------------------
group: 6
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 7 19.74 1.45 19.7 19.74 1.93 17.8 21.4 3.6 -0.16 -1.91 0.55
------------------------------------------------------------
group: 8
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 14 15.1 2.56 15.2 15.15 1.56 10.4 19.2 8.8 -0.36 -0.57 0.68
- Discussion:
describeBy(mpg, cyl)
computes descriptive statistics of miles per gallonmpg
variable, broken down by the unique values in the number of cylinders (cyl
).It calculates statistics such as the mean, sd, median, for
mpg
, separately for each unique number of cylinders (cyl
). [3]
Across two Categories
We extend the above discussion and study how to summarize continuous data split across two categories.
We review the use of the inbuilt functions (i)
aggregate()
and the function (ii)describeBy()
from packagepysch
. While thetapply()
function can theoretically be employed for this task, the resulting code tends to be long and lacks efficiency. Therefore, we opt to exclude it from practical use.
- Using
aggregate()
- Distribution of Mileage (
mpg
) by Cylinders (cyl
= {4,6,8}) and Transmisson Type (am
= {0,1}) [1]
# Aggregating 'mpg' column by both 'cyl' and 'am' columns.
# The aggregation function used is 'mean'.
<- aggregate(tb$mpg,
B0 by = list(tb$cyl, tb$am),
FUN = mean)
# Renaming the column names of the aggregated data frame 'B0'.
# 'Cylinders' 'Transmission' 'Mean_mpg' .
names(B0) <- c("Cylinders", "Transmission", "Mean_mpg")
# Displaying the aggregated data using 'kable' from the knitr package.
# Setting the number of digits to 2 for formatting and adding a caption.
kable(B0,
digits=2,
caption = "Mean of Mileage (mpg) by Cylinders and Transmission")
Cylinders | Transmission | Mean_mpg |
---|---|---|
4 | 0 | 22.90 |
6 | 0 | 19.12 |
8 | 0 | 15.05 |
4 | 1 | 28.08 |
6 | 1 | 20.57 |
8 | 1 | 15.40 |
- Discussion:
In our code, the first argument of
aggregate()
istb$mpg
, indicating that we want to perform computations on thempg
variable.The by argument is a list of variables by which we want to group our data, specified as
list(tb$cyl, tb$am)
. This means that separate computations are done for each unique combination ofcyl
andam
.The
FUN
argument indicates the function to be applied to each subset of our data. Here, we use mean, meaning that we compute the mean mpg for each group. [1]
- Using
aggregate()
for multiple continuous variables: Consider this extension of the above code for calculating the mean of three variables -mpg
,wt
, andhp
, grouped by both am and cyl variables:
- Distribution of Mileage (
mpg
), Weight (wt
), Horsepower (hp
) by Cylinders (cyl
= {4,6,8}) and Transmisson Type (am
= {0,1})
# Aggregating mpg, wt, hp by 'am' and 'cyl' columns.
# The 'mean' function is used to calculate the average
<- aggregate(list(mpg, wt, hp),
B1 by = list(am, cyl),
FUN = mean)
# Renaming the column names of the aggregated data frame
names(B1) <- c("Transmission", "Cylinders",
"Mean_mpg", "Mean_wt", "Mean_hp")
# Displaying the aggregated data using 'kable' from the knitr package.
# Formatting numbers to two decimal places and adding a descriptive caption.
kable(B1,
digits=2,
caption = "Mean of mpg, wt, hp by am, cyl")
Transmission | Cylinders | Mean_mpg | Mean_wt | Mean_hp |
---|---|---|---|---|
0 | 4 | 22.90 | 2.94 | 84.67 |
1 | 4 | 28.08 | 2.04 | 81.88 |
0 | 6 | 19.12 | 3.39 | 115.25 |
1 | 6 | 20.57 | 2.76 | 131.67 |
0 | 8 | 15.05 | 4.10 | 194.17 |
1 | 8 | 15.40 | 3.37 | 299.50 |
- Discussion:
In this code, the
aggregate()
function takes a list of the three variables as its first argument, indicating that the mean should be calculated for each of these variables separately within each combination ofam
andcyl
.The sequence of the categorizing variables also varies - initially, the data is grouped by
cyl
, followed by a subdivision based onam
.
- Using
aggregate()
with multiple functions: Consider an extension of the above code for calculating the mean and the SD ofmpg
, grouped by botham
andcyl
factor variables:
- Distribution of Mileage (
mpg
), by Cylinders (cyl
= {4,6,8}) and Transmission Type (am
= {0,1}) [1]
# Calculating the mean of 'mpg' grouped by 'cyl' and 'am'.
<- aggregate(tb$mpg,
agg_mean by = list(tb$cyl, tb$am),
FUN = mean)
# Calculating the standard deviation (sd) of 'mpg' by 'cyl' and 'am'.
<- aggregate(tb$mpg,
agg_sd by = list(tb$cyl, tb$am),
FUN = sd)
# Calculating the median of 'mpg' for each group of 'cyl' and 'am'.
<- aggregate(tb$mpg,
agg_median by = list(tb$cyl, tb$am),
FUN = median)
# Merging the mean and sd data frames based on 'cyl' and 'am'.
<- merge(agg_mean, agg_sd,
B2 by = c("Group.1", "Group.2"))
# Merging the above result
<- merge(B2, agg_median,
B2 by = c("Group.1", "Group.2"))
# Renaming the columns of the merged data
names(B2) <- c("Cylinders", "Transmission",
"Mean_mpg", "SD_mpg", "Median_mpg")
# Displaying the merged data
kable(B2,
digits=2, caption = "Mean, SD, Median of Mileage (mpg)
by Cylinders (cyl=4,6,8) and Transmission (am=0,1)")
Cylinders | Transmission | Mean_mpg | SD_mpg | Median_mpg |
---|---|---|---|---|
4 | 0 | 22.90 | 1.45 | 22.80 |
4 | 1 | 28.08 | 4.48 | 28.85 |
6 | 0 | 19.12 | 1.63 | 18.65 |
6 | 1 | 20.57 | 0.75 | 21.00 |
8 | 0 | 15.05 | 2.77 | 15.20 |
8 | 1 | 15.40 | 0.57 | 15.40 |
- Discussion:
We analyze our dataset to comprehend the relationships between vehicle miles per gallon (
mpg
), number of cylinders (cyl
), and type of transmission (am
).Initially, we computed the mean, standard deviation, and median of
mpg
for every unique combination ofcyl
andam
.After individual computations, we combined these results into a single, comprehensive data frame called
merged_data
. This structured dataset now clearly presents the average, variability, and median of fuel efficiency segmented by cylinder count and transmission type. [1]
- Using
describeBy()
from packagepsych
- The
describeBy()
function, part of thepsych
package, can be used to compute descriptive statistics of continuous variable, broken down by levels of a two categorical varaibles. Consider the following code:
# Selecting specific columns ('mpg', 'wt', 'hp')
<- tb[c("mpg", "wt", "hp")]
tb_columns
# Creating a list of factors ('am' and 'cyl') for grouping in describeBy.
<- list(tb$am, tb$cyl)
tb_factors
# Using the 'describeBy' function to provide descriptive statistics
# grouped by 'am' and 'cyl'.
<- describeBy(tb_columns,
B3
tb_factors)
print(B3)
Descriptive statistics by group
: 0
: 4
vars n mean sd median trimmed mad min max range skew kurtosis
mpg 1 3 22.90 1.45 22.80 22.90 1.93 21.50 24.40 2.90 0.07 -2.33
wt 2 3 2.94 0.41 3.15 2.94 0.06 2.46 3.19 0.73 -0.38 -2.33
hp 3 3 84.67 19.66 95.00 84.67 2.97 62.00 97.00 35.00 -0.38 -2.33
se
mpg 0.84
wt 0.24
hp 11.35
------------------------------------------------------------
: 1
: 4
vars n mean sd median trimmed mad min max range skew kurtosis
mpg 1 8 28.08 4.48 28.85 28.08 4.74 21.40 33.90 12.50 -0.21 -1.66
wt 2 8 2.04 0.41 2.04 2.04 0.36 1.51 2.78 1.27 0.35 -1.15
hp 3 8 81.88 22.66 78.50 81.88 20.76 52.00 113.00 61.00 0.14 -1.81
se
mpg 1.59
wt 0.14
hp 8.01
------------------------------------------------------------
: 0
: 6
vars n mean sd median trimmed mad min max range skew kurtosis
mpg 1 4 19.12 1.63 18.65 19.12 1.04 17.80 21.40 3.60 0.48 -1.91
wt 2 4 3.39 0.12 3.44 3.39 0.01 3.21 3.46 0.25 -0.73 -1.70
hp 3 4 115.25 9.18 116.50 115.25 9.64 105.00 123.00 18.00 -0.09 -2.33
se
mpg 0.82
wt 0.06
hp 4.59
------------------------------------------------------------
: 1
: 6
vars n mean sd median trimmed mad min max range skew kurtosis
mpg 1 3 20.57 0.75 21.00 20.57 0.00 19.70 21.00 1.30 -0.38 -2.33
wt 2 3 2.76 0.13 2.77 2.76 0.16 2.62 2.88 0.25 -0.12 -2.33
hp 3 3 131.67 37.53 110.00 131.67 0.00 110.00 175.00 65.00 0.38 -2.33
se
mpg 0.43
wt 0.07
hp 21.67
------------------------------------------------------------
: 0
: 8
vars n mean sd median trimmed mad min max range skew
mpg 1 12 15.05 2.77 15.20 15.10 2.30 10.40 19.20 8.80 -0.28
wt 2 12 4.10 0.77 3.81 4.04 0.41 3.44 5.42 1.99 0.85
hp 3 12 194.17 33.36 180.00 193.50 40.77 150.00 245.00 95.00 0.28
kurtosis se
mpg -0.96 0.80
wt -1.14 0.22
hp -1.44 9.63
------------------------------------------------------------
: 1
: 8
vars n mean sd median trimmed mad min max range skew kurtosis
mpg 1 2 15.40 0.57 15.40 15.40 0.59 15.00 15.80 0.8 0 -2.75
wt 2 2 3.37 0.28 3.37 3.37 0.30 3.17 3.57 0.4 0 -2.75
hp 3 2 299.50 50.20 299.50 299.50 52.63 264.00 335.00 71.0 0 -2.75
se
mpg 0.4
wt 0.2
hp 35.5
- Discussion:
We specify a subset of the dataframe
tb
that includes only the columns of interest –mpg
,wt
, andhp
and save it into a variabletb_columns
.Next, we create a list,
tb_factors
, that contains the factorsam
andcyl
.After that, we call the
describeBy()
function from thepsych
package. This function calculates descriptive statistics for each combination of levels of the factorsam
andcyl
and for each of the continuous variablesmpg
,wt
, andhp
. [3]
Visualizing Continuous Data
Let’s take a closer look at some of the most effective ways of visualizing univariate continuous data, including
Bee Swarm plots;
Stem-and-Leaf plots;
Histograms;
PDF and CDF Density plots;
Box plots;
Violin plots;
Q-Q plots.
Bee Swarm Plot
We extend a Bee Swarm plot of a one-dimensional scatter plot for a continuous variable, split by a categorical variable. [4]
Consider the following code, which generates a beeswarm plot displaying vehicle weights (
wt
) segmented by their number of cylinders (cyl
):
# Loading the 'beeswarm' package
library(beeswarm)
# Creating a bee swarm plot of the 'wt' (weight) column,
# grouped by the 'cyl' (cylinders) column.
beeswarm(tb$wt ~ tb$cyl,
main="Bee Swarm Plot of Weight (wt) by Cylinders (cyl=4,6,8)",
xlab="Number of Cylinders", # Label for the x-axis
ylab="Weight ('000 kg)", # Label for the y-axis
pch=16, # Type of points used in the plot (solid circle)
cex=0.8, # Size of the points
col=c("blue","red","darkgreen"), # Colors for different groups
horizontal = TRUE) # Orientation of the plot (horizontal)
- Discussion:
Data: We use tb\(wt ~ tb\)cyl to specify that we want a beeswarm plot for Weight (
wt
), split by no of cylinders (cyl
),Title: It is labeled “Bee Swarm Plot of Weight (wt) by Number of Cylinders”.
Axes Labels: The x-axis shows “Number of Cylinders”, while the y-axis denotes “Weight (’000 kg)”.
Data Points: Using
pch=16
, data points appear as solid circles.Size of Points: With
cex=0.8
, these circles are slightly smaller than default.Colors: The
col
parameter assigns colors (“blue”, “red”, and “dark green”) based on cylinder counts.Orientation: Set as horizontal with
horizontal=TRUE
.To summarize, this visual distinguishes vehicle weights across cylinder counts and highlights data point densities for each group. [4]
Stem-and-Leaf Plot across one Category
Suppose we wanted to visualize the distribution of a continuous variable across different levels of a categorical variable, using stem-and-leaf plots.
To illustrate, let us display vehicle weights (
wt
) separately for each transmission type (am
) using stem-and-leaf plots.
# Selecting 'wt' and 'am' columns from the 'tb' dataframe
<- tb[, c("wt", "am")]
tb3
# Splitting the 'tb3' dataframe into subsets based on the 'am' column.
<- split(tb3, tb3$am)
tb_split
# Applying a function to each subset in 'tb_split' using 'lapply'.
# Create a stem-and-leaf plot for the 'wt' column in each subset.
lapply(tb_split,
function(x) stem(x$wt))
The decimal point is at the |
2 | 5
3 | 22244445567888
4 | 1
5 | 334
The decimal point is at the |
1 | 5689
2 | 123
2 | 6889
3 | 2
3 | 6
$`0`
NULL
$`1`
NULL
- Discussion:
Column Selection: The code extracts the
wt
(weight) andam
(transmission type) columns fromtb
and saves them intb3
.Data Splitting: It then divides
tb3
into subsets based onam
values, resulting in separate groups for each transmission type.Visualization: Using
lapply()
, the code generates stem-and-leaf plots for thewt
values in each subset, showcasing weight distributions for different transmission types. In this context, it shows the distribution of vehicle weights for each transmission type (automatic and manual). [5]
Histograms across one Category
- Visualizing histograms of car mileage (
mpg
) broken down by transmission (am
=0,1)
<- split(tb$mpg, tb$am) # Split the data by 'am' variable
split_data # Splitting the 'mpg' data from the 'tb' dataframe by the 'am' variable.
<- split(tb$mpg, tb$am)
split_data
# Setting up the graphics layout for two plots side by side.
par(mfrow = c(1, 2))
# Defining a color vector for use in the histograms.
<- c("skyblue", "gold")
color_vector
# Creating a histogram for the subset with 'am' equals 0.
hist(split_data[[1]],
main = "Histogram of mpg (am = 0)",
breaks = seq(10, 35, by = 5), # bin ranges (10-15, 15-20, etc.)
xlab = "Miles per gallon",
col = color_vector[1], # Applying the first color
border = "black", # Setting the border color of the bins.
ylim = c(0, 10)) # Setting y-axis limits.
# Creating a histogram for the subset with 'am' equals 1.
hist(split_data[[2]],
main = "Histogram of mpg (am = 1)",
breaks = seq(10, 35, by = 5), # bin ranges (10-15, 15-20, etc.)
xlab = "Miles per gallon",
col = color_vector[2], # Applying the second color
border = "black", # Setting the border color of the bins.
ylim = c(0, 10)) # Setting y-axis limits.
In Appendix A1, we have alternative code written using a for loop
- Discussion:
We aim to visualize the distribution of the
mpg
values from thetb
dataset based on theam
variable, which can be either 0 or 1.Data Splitting: We segregate
mpg
values into two subsets using thesplit
function, depending on theam
values. In R, the double brackets [[ ]] are used to access the elements of a list or a specific column of a data frame.split_data[[1]]
accesses the first element of the listsplit_data
.Layout Setting: The
par
function is configured to display two plots side by side in a single row and two columns format.
Color Vector: We introduce a
color_vector
to assign distinct colors to each histogram for differentiation.Histogram: Two histograms are generated, one for each
am
value (0 and 1). These histograms use various parameters like title, x-axis label, color, and y-axis limits to provide a clear representation of the data’s distribution. [1]
Probability Density Function (PDF) across one Category
- Visualizing Probability Density Functions (PDF) of car mileage (
mpg
) broken down by transmission (am
=0,1) [1]
# Splitting 'mpg' data from the 'tb' dataframe by 'am' values.
<- split(tb$mpg, tb$am)
split_data
# Setting up the graphical layout for two plots side by side.
par(mfrow = c(1, 2))
# Defining color scheme for the plots.
<- c("skyblue", "gold")
color_vector
# Calculating the density for the subset with 'am' equals 0 and plotting it.
<- density(split_data[[1]])
dens_0 plot(dens_0,
main = "PDF of mpg for am = 0", # Title for the first plot.
xlab = "Miles per gallon", # X-axis label.
col = color_vector[1], # Applying the first color.
border = "black",
ylim = c(0, 0.15), # Setting y-axis limits.
lwd = 2) # Line width for the density plot.
polygon(dens_0, col = color_vector[1], border = "black")
# Filling under the density curve.
# Calculating the density for the subset with 'am' equals 1 and plotting it.
<- density(split_data[[2]])
dens_1 plot(dens_1,
main = "PDF of mpg for am = 1", # Title for the second plot.
xlab = "Miles per gallon", # X-axis label.
col = color_vector[2], # Applying the second color.
border = "black",
ylim = c(0, 0.15), # Setting y-axis limits.
lwd = 2) # Line width for the density plot.
polygon(dens_1, col = color_vector[2], border = "black")
# Filling under the density curve.
In Appendix A2, we have alternative code written using a for loop
- Discussion:
dens_0 <- density(split_data[[1]])
calculates the density values for the subset wheream
is 0.The subsequent plot function visualizes the density curve, setting various parameters like the title, x-axis label, color, and line width.
The polygon function fills the area under the density curve with the specified color, giving a shaded appearance to the plot.
The process is repeated for the subset where
am
is 1. The code calculates the density, plots it, and then uses the polygon function to shade the area under the curve. [1]
In Appendix A3, we demonstrate how to draw overlapping PDFs on the same plot, using base R functions.
Cumulative Density Function (CDF) across one Category
In Appendix A4, we demonstrate how to draw a CDF, using base R functions
Box Plots across one Category
- Visualizing Median using Box Plot – median weight of the cars broken down by cylinders (
cyl
=4,6,8)
# Creating a boxplot for 'mpg' values grouped by 'cyl'.
boxplot(mpg ~ cyl,
main = "Boxplot of Mileage (mpg) by Cylinders (cyl=4,6,8)",
xlab = "Mileage (mpg)", # Label for the x-axis.
ylab = "Cylinders (cyl)", # Label for the y-axis.
col = c("skyblue"), # Color of the boxplots.
horizontal = TRUE # Setting the boxplot to be horizontal.
)
- Discussion:
This code creates a visual representation of the distribution of miles per gallon (
mpg
) based on the number of cylinders (cyl
), using theboxplot
function from the base graphics package in R. down:Data Input: The formula
mpg ~ cyl
instructs R to create separate boxplots for each unique value ofcyl
, with each boxplot representing the distribution ofmpg
values for that particular cylinder count.Title Configuration:
main
specifies the title of the plot as “Boxplot of Miles Per Gallon (mpg) by Cylinders.”Axis Labels: The labels for the x-axis and y-axis are set using
xlab
andylab
, respectively. Here,xlab
labels the mileage (ormpg
), whileylab
labels the number of cylinders (cyl
).Color Choice: The
col
argument is set to “skyblue,” which colors the body of the boxplots in a light blue shade.Orientation: By setting
horizontal
toTRUE
, the boxplots are displayed in a horizontal orientation rather than the default vertical orientation.In essence, we’re visualizing the variations in car mileage based on the number of cylinders using horizontal boxplots. This type of visualization helps in understanding the central tendency, spread, and potential outliers of mileage for different cylinder counts.
- Visualizing Median using Box Plot – median weight of the cars broken down by cylinders (
cyl
=4,6,8) and Transmission (am
=0,1). [1] [7]
# Creating a boxplot for 'mpg' values grouped by both 'cyl' and 'am'.
boxplot(mpg ~ cyl * am,
main = "Boxplot of Mileage (mpg) by Cylinders and Transmission",
xlab = "Mileage (mpg)", # Label for the x-axis.
ylab = "Cylinders & Transmission (cyl:am)", # Label for the y-axis.
col = c("skyblue", "gold"), # Colors of the boxplots
horizontal = TRUE # Setting the boxplot to be horizontal.
)
# Adding a legend to the plot.
legend("topright", # Position of the legend in the plot.
legend = c("Automatic", "Manual"), # Text for the legend.
fill = c("skyblue", "gold"), # Color filling
title = "Transmission" # Title of the legend.
)
- Discussion:
This R code presents a horizontal boxplot showcasing the distribution of mileage (
mpg
) based on the interaction between the number of car cylinders (cyl
) and the type of transmission (am
).Boxplot Creation: The
boxplot
function is used to generate the visualization. With the formulampg ~ cyl * am
, we plot the distribution ofmpg
for every combination ofcyl
andam
.Color Configuration: The
col
argument specifies the colors for the boxplots. We’ve opted for “skyblue” and “gold” for differentiation. Depending on the order of factor levels in your data, one color typically represents one level ofam
(e.g., automatic) and the other color represents the second level (e.g., manual).Orientation: The
horizontal
argument, set toTRUE
, orients the boxplots horizontally.Legend Addition: Following the boxplot, we add a legend using the
legend
function. Placed at the “topright” position, this legend differentiates between “Automatic” and “Manual” transmissions using the designated colors.In essence, our code generates a detailed visualization that elucidates the mileage distribution for various combinations of cylinder counts and transmission types in cars. [7]
Means Plot across one Category
Visualizing Means – mean plot showing the average weight of the cars, broken down by transmission (am
= 0 or 1). [8]
# Loading the 'gplots' package
library(gplots)
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
# Creating a plot of mean 'mpg' values for each 'cyl' category
plotmeans(data = tb,
~ cyl, # Formula indicating 'mpg' plotted against 'cyl'.
mpg mean.labels = TRUE, # Option to display the mean value labels.
digits = 3, # Number of digits for the mean value labels.
main = "Mean(mpg) by cyl = {4,6,8}" # Main title of the plot.
)
Means Plot across two Categories
We show a mean plot showing the mean weight of the cars broken down by Transmission Type (am
= 0 or 1) & cylinders (cyl
= 4,6,8). [8]
# Loading the 'gplots' package for plotting data.
library(gplots)
# The plot will show mean 'mpg' values
# for each combination of 'am' and 'cyl' categories
plotmeans(mpg ~ interaction(am, cyl, sep = ", "),
data = tb, # Specifying the dataframe.
mean.labels = TRUE, # display mean value labels on the plot.
digits = 2, # Setting the number of digits to 2.
connect = FALSE, # Disabling the connection lines between means.
main = "Mean (mpg) by cyl = {4,6,8} & am = {0,1}",
xlab = "cyl = {4,6,8} & am = {0,1}", # Label for the x-axis.
ylab = "Average MPG" # Label for the y-axis.
)
Summary of Chapter 12 – Bivariate Continuous data (Part 1 of 4)
This chapter on R programming skillfully navigates the analysis of bivariate continuous data across categorical variables. It begins with preparing the mtcars
dataset, transforming key variables like cylinders and transmission type into categorical factors for in-depth analysis.
The chapter primarily focuses on summarizing continuous data using R functions like aggregate()
, tapply()
, and describeBy()
from the psych
package. These functions are adept at calculating and presenting summary statistics such as means and standard deviations for different categorical groups.
In terms of visualization, the chapter showcases a variety of methods to graphically represent continuous data across categories. Techniques such as Bee Swarm plots, histograms, density plots, and box plots are highlighted for their effectiveness in illustrating data distributions and variations across categorical groups.
Overall, the chapter provides a comprehensive yet concise framework for analyzing and visualizing the interplay between continuous and categorical data in R, offering valuable insights for data analysis and interpretation.
References
Basic R Programming:
[1] Chambers, J. M. (2008). Software for Data Analysis: Programming with R (Vol. 2, No. 1). Springer.
Crawley, M. J. (2012). The R Book. John Wiley & Sons.
Gardener, M. (2012). Beginning R: The Statistical Programming Language. John Wiley & Sons.
Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions and Simulations. O’Reilly Media, Inc.
Kabacoff, R. (2022). R in Action: Data Analysis and Graphics with R and Tidyverse. Simon and Schuster.
Peng, R. D. (2016). R Programming for Data Science (pp. 86-181). Leanpub.
R Core Team. (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/.
Tippmann, S. (2015). Programming Tools: Adventures with R. Nature, 517(7532), 109-110.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.
knitr:
[2] Xie, Y. (2018). knitr: A Comprehensive Tool for Reproducible Research in R. In Implementing Reproducible Research (pp. 3-31). Chapman and Hall/CRC.
psych:
[3] Revelle, W. (2020). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R Package Version 2.0.12. Retrieved from https://CRAN.R-project.org/package=psych.
beeswarm:
[4] Eklund, A. (2020). ggbeeswarm: Categorical Scatter (Violin Point) Plots. R Package Version 0.6.0. Retrieved from https://CRAN.R-project.org/package=ggbeeswarm.
Stem and Leaf:
[5] Lovie, S. (2005). Stem and Leaf Plot. Encyclopedia of Statistics in Behavioral Science.
Histograms:
[6] Scott, D. W. (1979). On Optimal and Data-Based Histograms. Biometrika, 66(3), 605-610.
Wand, M. P., & Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall/CRC.
Box Plots:
[7] McGill, R., Tukey, J. W., & Larsen, W. A. (1978). Variations of Box Plots. The American Statistician, 32(1), 12-16.
gplots:
[8] Warnes, M. G. R., Bolker, B., Bonebakker, L., Gentleman, R., Huber, W., & Liaw, A. (2016). Package ‘gplots’: Various R Programming Tools for Plotting Data.
Statistics using R:
[9] Braun, W. J., & Murdoch, D. J. (2021). A First Course in Statistical Programming with R. Cambridge University Press.
Cohen, Y., & Cohen, J. Y. (2008). Statistics and Data with R: An Applied Approach Through Examples. John Wiley & Sons.
Dalgaard, P. (2008). Introductory Statistics with R. Springer.
Davies, T. M. (2016). The Book of R: A First Course in Programming and Statistics. No Starch Press.
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage Publications.
Fox, J., & Weisberg, S. (2018). An R Companion to Applied Regression. Sage Publications.
Hyndman, R. J., & Fan, Y. (1996). Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365.
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.
R Core Team. (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/.
Schumacker, R. E. (2014). Learning Statistics Using R. Sage Publications.
Schumacker, R., & Tomek, S. (2013). Understanding Statistics Using R. Springer Science & Business Media.
Appendix
Appendix A1
Visualizing histograms of car mileage (mpg
) broken down by transmission (am
=0,1)
Code written using a for loop
# Split the data by 'am' variable
<- split(tb$mpg, tb$am)
split_data
# Create a 1-row 2-column layout
par(mfrow = c(1, 2))
# Define the color vector
<- c("skyblue", "gold")
color_vector
# Create a histogram for each subset
for (i in 1:length(split_data)) {
hist(split_data[[i]],
main = paste("Histogram of mpg for am =", i - 1),
breaks = seq(10, 35, by = 5), # bins with ranges 10-15, 15-20
xlab = "Miles per gallon",
col = color_vector[i], # Use the color vector,
border = "black",
ylim = c(0, 10))
}
Appendix A2
Visualizing Probability Density Function (PDF) of car milegage (mpg
) broken down by transmission (am
=0,1), using for loop
# Split the data by 'am' variable
<- split(tb$mpg, tb$am)
split_data
# Create a 1-row 2-column layout
par(mfrow = c(1, 2))
# Define the color vector
<- c("skyblue", "gold")
color_vector
# Create a density plot for each subset
for (i in 1:length(split_data)) {
# Calculate density
<- density(split_data[[i]])
dens
# Plot density
plot(dens,
main = paste("PDF of mpg for am =", i - 1),
xlab = "Miles per gallon",
col = color_vector[i],
border = "black",
ylim = c(0, 0.15), # Adjust this value if necessary
lwd = 2) # line width
# Add a polygon to fill under the density curve
polygon(dens, col = color_vector[i], border = "black")
}
Appendix A3
Visualizing Probability Density Function (PDF) of car milegage (mpg
) broken down by transmission (am
=0,1), overlapping PDFs on the same plot
# Split the data by 'am' variable
<- split(tb$mpg, tb$am)
split_data
# Define the color vector
<- c("skyblue", "gold")
color_vector
# Define the legend labels
<- c("am = 0", "am = 1")
legend_labels
# Create a density plot for each subset
# Start with an empty plot with ranges accommodating both data sets
plot(0, 0,
xlim = range(tb$mpg), ylim = c(0, 0.15),
type = "n",
xlab = "Miles per gallon", ylab = "Density",
main = "PDFs of mpg for automatic, manual transmissions")
for (i in 1:length(split_data)) {
# Calculate density
<- density(split_data[[i]])
dens
# Add density plot
lines(dens,
col = color_vector[i],
lwd = 2) # line width
# Add a polygon to fill under the density curve
polygon(dens,
col = adjustcolor(color_vector[i],
alpha.f = 0.8),
border = "black")
}
# Add legend to the plot
legend("topright",
legend = legend_labels,
fill = color_vector,
border = "black")
Appendix A4
# Split the data by 'am' variable
<- split(tb$mpg, tb$am)
split_data
# Define the color vector
<- c("blue", "black")
color_vector
# Define the legend labels
<- c("am = 0", "am = 1")
legend_labels
# Create a cumulative density plot for each subset
# Start with an empty plot with ranges accommodating both data sets
plot(0, 0, xlim = range(mtcars$mpg), ylim = c(0, 1), type = "n",
xlab = "Miles per gallon", ylab = "Cumulative Density",
main = "CDFs of Mileage (mpg) for automatic, manual transmissions")
for (i in 1:length(split_data)) {
# Calculate empirical cumulative density function
<- ecdf(split_data[[i]])
ecdf_func
# Add CDF plot using curve function
curve(ecdf_func(x),
from = min(split_data[[i]]), to = max(split_data[[i]]),
col = color_vector[i],
add = TRUE,
lwd = 2) # line width
}# Add legend to the plot
legend("bottomright",
legend = legend_labels,
fill = color_vector,
border = "black")