data(mtcars)
Exploring Dataframes (Part 1 of 2)
Chapter 5.
This chapter delves into an in-depth examination of dataframes in R.
- The
mtcars
dataset is a readily available set in R, originally sourced from the 1974 Motor Trend US magazine. It includes data related to fuel consumption and 10 other factors pertaining to car design and performance, recorded for 32 vehicles from the 1973-74 model years. [1]
Next, we will understand R code to explore a dataframe, step-by-step. We review eight basic functions to get started exploring dataframes [2] [7]
- To load the mtcars dataset in R, use this command:
Reviewing a dataframe
View()
: This function opens the dataset in a spreadsheet-style data viewer.
View(mtcars)
head()
: This function prints the first six rows of the dataframe.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail()
: This function prints the last six rows of the dataframe.
tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
dim()
: This function retrieves the dimensions of a dataframe, i.e., the number of rows and columns.
dim(mtcars)
[1] 32 11
nrow()
: This function retrieves the number of rows in the dataframe.
nrow(mtcars)
[1] 32
ncol()
: This function retrieves the number of columns in the dataframe.
ncol(mtcars)
[1] 11
names()
: This function retrieves the column names of a dataframe.
colnames()
: This function also retrieves the column names of a dataframe.
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
Accessing data within a dataframe
$
- In R, the dollar sign
$
is a unique operator that lets us retrieve specific columns from a dataframe or elements from a list. [2] - For instance, consider the dataframe
mtcars
. If we wish to fetch the data from the data columnmpg
(miles per gallon), we would use the codemtcars$mpg
. This will yield a vector containing the data from thempg
column. [2] [7]
# Extract the mpg column in mtcars dataframe as a vector
<- mtcars$mpg
mpg_vector # Print the mpg vector
print(mpg_vector)
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[[
or[
The usage of
$
is limited since it doesn’t support character substitution for dynamic column access inside functions. In such cases, we can use the double square brackets[[
or single square brackets[
.As an example, suppose we have a character string stored in a variable
var
asvar <- "mpg"
.Here, the code
mtcars$var
will not return thempg
column.However, if we instead use the code
mtcars[[var]]
ormtcars[var]
, we will get thempg
column.
# Let's say we have a variable var
<- "mpg"
var # Now we can access the mpg column in mtcars dataframe using [[
<- mtcars[[var]]
mpg_data1 print(mpg_data1)
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
# Alternatively, we can use [
<- mtcars[, var]
mpg_data2 print(mpg_data2)
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Data Structures
In R, str()
and class()
functions are essential for understanding data structures. str()
reveals the detailed structure of objects, such as the mtcars
dataset, providing a clear view of data composition. The class()
function identifies an object’s data type, crucial for applying correct methods in R. It efficiently categorizes objects, like numeric vectors, character vectors, and data frames, facilitating appropriate data manipulation and analysis.
str()
: This function displays the internal structure of an R object. [2] [7]
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
class()
: This function is used to determine the class or data type of an object. It returns a character vector specifying the class or classes of the object.
<- c(1, 2, 3) # Create a numeric vector
x class(x) # Output: "numeric"
[1] "numeric"
<- "Hello, My name is Sameer Mathur!" # Create a character vector
y class(y) # Output: "character"
[1] "character"
class(x)
returns “numeric” because x is a numeric vector. Similarly, class(y) returns “character” because y is a character vector.
<- data.frame(a = 1:5, b = letters[1:5]) # Create a data frame
z class(z) # Output: "data.frame"
[1] "data.frame"
class(z)
returns “data.frame” because z is a data frame.
sapply(mtcars, class)
mpg cyl disp hp drat wt qsec vs
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
am gear carb
"numeric" "numeric" "numeric"
Factors
In R, factors are a specific data type used for representing categorical variables or data with discrete levels or categories. They are employed to store data that has a limited number of distinct values, such as “male” or “female,” “red,” “green,” or “blue,” or “low,” “medium,” or “high.” [3]
Factors in R consist of both values and levels. The values represent the actual data, while the levels correspond to the distinct categories or levels within the factor. Factors are particularly useful for statistical analysis as they facilitate the representation and analysis of categorical data efficiently.
For example, in order to change the data type of the
am
,cyl
,vs
, andgear
variables in themtcars
dataset to factors, we can utilize thefactor()
function. Here’s an example demonstrating how to achieve this:
# Convert variables to factors
$am <- factor(mtcars$am)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear) mtcars
The code above applies the
factor()
function to each variable, thereby converting them to factors. By assigning the result back to the respective variables, we effectively change their data type to factors. This conversion retains the original values while establishing levels based on the distinct values present in each variable.After executing this code, the
am
,cyl
,vs
, andgear
data variables in themtcars
dataset will be of thefactor
data type. And we can verify this by re-running thestr()
function
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
$ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- Levels of a factor variable:
The
levels()
function can be used to extract the distinct levels or categories of a factor variable. [3]For example, after the
cyl
variable is converted to a factor, thelevels()
function can be used to extract the distinct levels or categories of that factor. By executinglevels(mtcars\$cyl)
, we see the levels present in thecyl
variable. For example, if thecyl
variable has been transformed into a factor with levels “4”, “6”, and “8”, the result oflevels(mtcars$cyl)
will be a character vector displaying these three levels:
levels(mtcars$cyl)
[1] "4" "6" "8"
It is important to note that the order of the levels in the output corresponds to their appearance in the original data.
To change the base level of a factor variable in R, we can use the
relevel()
function. This function allows us to reassign a new base level by rearranging the order of the levels in the factor variable.
# Assuming 'cyl' is a factor variable with levels "4", "6", and "8"
$cyl <- relevel(mtcars$cyl, ref = "6") mtcars
In the code above, we apply the
relevel()
function to thecyl
variable, specifyingref = "6"
to set “6” as the new base level.After executing this code, the levels of the
mtcars$cyl
factor variable will be reordered, with “6” becoming the new base level. The order of the levels will be “6”, “4”, and “8” instead of the original order.For convenience, we will change the base level back to “4”.
# Assuming `cyl` is a factor variable with levels "4", "6", and "8"
$cyl <- relevel(mtcars$cyl, ref = "4") mtcars
droplevels()
: This function is helpful for removing unused factor levels. It removes levels from a factor variable that do not appear in the data, reducing unnecessary levels and ensuring that the factor only includes relevant levels.
# Remove unused levels from `cyl`
$cyl <- droplevels(mtcars$cyl)
mtcars
# Check the levels of `cyl` after removing unused levels
levels(mtcars$cyl)
[1] "4" "6" "8"
We can apply
droplevels()
tomtcars$cyl
to remove any unused levels from the factor variable. This function removes factor levels that are not present in the data. In this case all three levels were present in the data and therefore nothing was removed.cut()
: This function allows us to convert a continuous variable into a factor variable by dividing it into intervals or bins. This is useful when we want to group numeric data into categories or levels. [3]
# Create a new factor variable `mpg_category` by cutting `mpg` into intervals
$mpg_category <- cut(mtcars$mpg,
mtcarsbreaks = c(0, 20, 30, Inf),
labels = c("Low", "Medium", "High"))
# Summarize the resulting `mpg_category` variable
summary(mtcars$mpg_category)
Low Medium High
18 10 4
In the provided code, a new factor variable called
mpg_category
is generated based on thempg
(miles per gallon) variable from themtcars
dataset. This is achieved using thecut()
function, which segments thempg
values into distinct intervals and assigns appropriate factor labels.The
cut()
function takes several arguments:mtcars$mpg
represents the variable to be divided;breaks
specifies the cutoff points for interval creation. Here, we define three intervals: values up to 20, values between 20 and 30 (inclusive), and values greater than 30. Here, thebreaks
argument is defined asc(0, 20, 30, Inf)
to indicate these intervals;labels
assigns labels to the resulting factor levels. In this instance, the labels “Low”, “Medium”, and “High” are provided to correspond with the respective intervals.Having demonstrated how to create the new colums
mpg_category
, we will now drop this column from the dataframe.
# drop the column `mpg_category`
$mpg_category = NULL mtcars
Logical operations
Here are some logical operations functions in R. [4] [7]
subset()
: This function returns a subset of a data frame according to condition(s).
# Find cars that have cyl = 4 and mpg < 28
subset(mtcars, cyl == 4 & mpg < 22)
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Find cars that have wt > 5 or mpg < 15
subset(mtcars, wt > 5 | mpg < 15)
mpg cyl disp hp drat wt qsec vs am gear carb
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
which()
: This function returns the indexes of a vector’s members that satisfy a condition.
# Find the indices of rows where mpg > 20
<- which(mtcars$mpg > 20)
indices indices
[1] 1 2 3 4 8 9 18 19 20 21 26 27 28 32
ifelse()
: This function applies a logical condition to a vector and returns a new vector with values depending on whether the condition is TRUE or FALSE.
# Create a new column "high_mpg" based on mpg > 20
$high_mpg <- ifelse(mtcars$mpg > 20, "Yes", "No") mtcars
- Dropping a column: We can drop a column by setting it to
NULL
. [7]
# Drop the column "high_mpg"
$high_mpg <- NULL mtcars
all()
: If every element in a vector satisfies a logical criterion, this function returns TRUE; otherwise, it returns FALSE.
# Check if all values in mpg column are greater than 20
all(mtcars$mpg > 20)
[1] FALSE
any()
: If at least one element in a vector satisfies a logical criterion, this function returns TRUE; otherwise, it returns FALSE.
# Check if any of the values in the mpg column are greater than 20
any(mtcars$mpg > 20)
[1] TRUE
- Subsetting based on a condition:
The logical expression []
and square bracket notation can be used to subset the mtcars
dataset according to one or more conditions. [4] [7]
# Subset mtcars based on mpg > 20
<- mtcars[mtcars$mpg > 20, ]
mtcars_subset mtcars_subset
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
sort()
: This function arranges a vector in an increasing or decreasing sequence.
sort(mtcars$mpg) # increasing order
[1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7
[16] 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4
[31] 32.4 33.9
sort(mtcars$mpg, decreasing = TRUE) # decreasing order
[1] 33.9 32.4 30.4 30.4 27.3 26.0 24.4 22.8 22.8 21.5 21.4 21.4 21.0 21.0 19.7
[16] 19.2 19.2 18.7 18.1 17.8 17.3 16.4 15.8 15.5 15.2 15.2 15.0 14.7 14.3 13.3
[31] 10.4 10.4
order()
: This function provides an arrangement which sorts its initial argument into ascending or descending order.
order(mtcars$mpg), ] # ascending order mtcars[
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
For descending order, we can instead write the following code: mtcars[order(-mtcars$mpg), ]
Statistical functions
Statistical functions in R, such as mean()
, median()
, sd()
, var()
, cor()
, and unique()
, provide fundamental tools for data analysis. mean()
calculates the arithmetic mean, offering an average value. median()
determines the middle value in a dataset, providing a measure of central tendency. sd()
calculates the standard deviation, indicating data variability. var()
computes variance, measuring data spread. cor()
assesses the correlation between variables, essential for understanding relationships in data. Lastly, unique()
extracts distinct elements from a vector, useful for identifying variety within datasets. These functions, demonstrated using the mtcars
dataset, are key in statistical analysis and data exploration. [5] [7]
mean(mtcars$mpg)
[1] 20.09062
median(mtcars$mpg)
[1] 19.2
sd(mtcars$mpg)
[1] 6.026948
var(mtcars$mpg)
[1] 36.3241
cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594
unique(mtcars$mpg)
[1] 21.0 22.8 21.4 18.7 18.1 14.3 24.4 19.2 17.8 16.4 17.3 15.2 10.4 14.7 32.4
[16] 30.4 33.9 21.5 15.5 13.3 27.3 26.0 15.8 19.7 15.0
Summarizing a dataframe
Summarizing a continuous data column
summary()
: This function is a convenient tool to generate basic descriptive statistics for our dataset. It provides a succinct snapshot of the distribution characteristics of our data. [5] [7]
summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
- When applied to a vector or a specific column in a dataframe, it generates the following:
Min: This represents the smallest recorded value in the mpg column.
1st Qu: This indicates the first quartile or the 25th percentile of the mpg column. It implies that 25% of all mpg values fall below this threshold.
Median: This value signifies the median or the middle value of the mpg column, also known as the 50th percentile. Half of the mpg values are less than this value.
Mean: This denotes the average value of the mpg column.
3rd Qu: This represents the third quartile or the 75th percentile of the mpg column. It shows that 75% of all mpg values are less than this value.
Max: This indicates the highest value observed in the mpg column.
When we use
summary(mtcars$mpg),
it returns these six statistics for thempg
(miles per gallon) column in the mtcars dataset.When used with an entire dataframe, it applies to each column individually and provides a quick overview of the data.
Summarizing a categorical data column
summary(mtcars$cyl)
4 6 8
11 7 14
- The output of
summary(mtcars$cyl)
displays the frequency distribution of the levels within thecyl
factor variable. It shows the count or frequency of each level, which in this case are “4”, “6”, and “8”. The summary will provide a concise overview of the distribution of these levels within the dataset.
summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear carb
Min. :1.513 Min. :14.50 0:18 0:19 3:15 Min. :1.000
1st Qu.:2.581 1st Qu.:16.89 1:14 1:13 4:12 1st Qu.:2.000
Median :3.325 Median :17.71 5: 5 Median :2.000
Mean :3.217 Mean :17.85 Mean :2.812
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :8.000
Creating new functions in R
- We illustrate how to create a custom function in R that computes the mean of any given numeric column in the mtcars dataframe [6] [7]
# Function creation
<- function(df, column) {
compute_average # Compute the average of the specified column
<- mean(df[[column]], na.rm = TRUE)
average_val
# Return the computed average
return(average_val)
}# Utilize the created function
<- compute_average(mtcars, "mpg")
average_mpg print(average_mpg)
[1] 20.09062
<- compute_average(mtcars, "hp")
average_hp print(average_hp)
[1] 146.6875
In the above code,
compute_average
is a custom function which takes two arguments: a dataframe (df) and a column name as a string. The function computes themean
of the specified column in the provided dataframe, withna.rm = TRUE
ensuring thatNA
values (if any) are removed before the mean calculation.After defining the function, we utilize it to calculate the average values of the
mpg
andhp
columns in themtcars
dataframe. These computed averages are then printed.
Summary of Chapter 5 – Exploring Dataframes
Chapter 5 offers an in-depth exploration of dataframes in R, emphasizing the mtcars
dataset. It begins by introducing essential functions for examining dataframes like View()
, head()
, tail()
, and dim()
, progressing to more complex data accessing methods using $
and square brackets. The chapter also covers data structures, emphasizing factors in R and their relevance in statistical modeling. Logical operations in R are explored, highlighting functions like subset()
, which()
, and ifelse()
. Statistical analysis is addressed through functions like mean()
, median()
, and cor()
. The chapter culminates with a focus on custom function creation, enhancing R’s functionality for specific tasks.
References
[1] Krasser, R. (2023, October 11). Explore mtcars. The Comprehensive R Archive Network. Retrieved from https://cran.r-project.org/web/packages/explore/vignettes/explore_mtcars.html
RDocumentation. (n.d.). mtcars: Motor Trend Car Road Tests. Retrieved from https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars
[2] W3Schools. (n.d.). R Data Frames. Retrieved from https://www.w3schools.com/r/r_data_frames.asp
RDocumentation. (n.d.). data.frame function. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame
Programiz. (n.d.). R Data Frame (with Examples). Retrieved from https://www.programiz.com/r-programming/data-frame
Dataquest. (2023). How to Create a Dataframe in R with 30 Code Examples. Retrieved from https://www.dataquest.io/blog/tutorial-dataframe-in-r
[3] University of California, Berkeley. (n.d.). Factors in R. Retrieved from https://www.stat.berkeley.edu/~s133/factors.html
[4] R Core Team. (n.d.). subset: Subsetting Vectors, Matrices, and Data Frames. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset
R Core Team. (n.d.). ifelse: Conditional Element Selection. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse
[5] R Core Team. (n.d.). mean: Arithmetic Mean. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean
R Core Team. (n.d.). median: Median Value. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/median
R Core Team. (n.d.). sd: Standard Deviation. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sd
R Core Team. (n.d.). var: Variance. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/var
R Core Team. (n.d.). cor: Correlation. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cor
R Core Team. (n.d.). summary: Object Summaries. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary
[6] R Core Team. (n.d.). function: Function Definition. Retrieved from https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/function
Basic R Programming
[7] Chambers, J. M. (2008). Software for Data Analysis: Programming with R (Vol. 2, No. 1). New York: Springer.
Crawley, M. J. (2012). The R Book. John Wiley & Sons.
Gardener, M. (2012). Beginning R: The Statistical Programming Language. John Wiley & Sons.
Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions and Simulations. O’Reilly Media, Inc.
Kabacoff, R. (2022). R in Action: Data Analysis and Graphics with R and Tidyverse. Simon and Schuster.
Peng, R. D. (2016). R Programming for Data Science (pp. 86-181). Victoria, BC, Canada: Leanpub.
R Core Team. (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www.R-project.org/.
Tippmann, S. (2015). Programming Tools: Adventures with R. Nature, 517(7532), 109-110.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science. O’Reilly Media, Inc.