Basic Data Types and Data Structures in R
Overview
Class Date: 8/29/2024 -- In Class
Teaching: 90 min
Exercises: 30 minQuestions
What are the most common data types in R?
What are the basic data structures in R?
How do I access data within the basic data structures?
Objectives
Understand the most commonly encountered data types in R and how these data types are used in data structures.
Create and manipulate vectors and matrices of different types.
Check the data type of a variable, vector, or matrix.
Understand the structure and properties of basic data structures (vectors and matrices).
In Class
Basic Data Types in R
R uses a variety of data types, which define the properties of the value stored in a variable. The three data types that you will use most commonly are character (text strings), logical (TRUE/FALSE values), and numeric (decimal or “double” numeric values) objects. For the most part, the data type of a variable is detected by the format of the value assigned:
char1 <- "hello!" # "character" is your basic text string data type
num1 <- 20.5 # "numeric" is your most general data type for real, decimal numbers
logic1 <- TRUE # "logical" data type is a simply binary: TRUE or FALSE
logic2 <- F # T or F also work
Use the class()
function to determine the type of a variable:
class(char1)
[1] "character"
class(num1)
[1] "numeric"
class(logic1)
[1] "logical"
Character Variables
The character data type is used to store basic textual information. A character vector is defined by text in quotes (""
). You can include most forms of text:
greeting <- "How are you today?"
equation <- "1 + 3 - 2 = 2"
The major exception is the backslash (\
), which R uses as an escape character. In computing, an escape character is a metacharacter that is not interpreted literally, but instead causes the computer to interpret the character following the escape character in a different way.
escape <- "This string \does not compute."
Error: '\d' is an unrecognized escape in character string (<text>:1:25)
Note that the error message is complaining about the \d
, not just the \
. Many characters invoke specific behavior when preceded by the \
which can be useful depending on your goals. For example, \n
is interpreted as a line break:
one.line <- "Line 1 Line 2"
two.lines <- "Line 1\nLine 2"
writeLines(one.line)
Line 1 Line 2
writeLines(two.lines)
Line 1
Line 2
This can be useful when you are trying to add text to a chart but want the text to appear on separate lines. Some functions just display the string verbatim, instead of trying to interpret the escape characters:
print(two.lines)
[1] "Line 1\nLine 2"
As you might expect, you can’t perform numeric operations on characters:
2 * two.lines
Error in 2 * two.lines: non-numeric argument to binary operator
So what if you actually want your string to include a backslash (\
)? You can do this by typing a double backslash (\\
), which is effectively using the first \
to “escape” the second \
:
writeLines("Single\\backslash.")
Single\backslash.
There are many functions for manipulating character data types. Two examples are paste()
, which combines text strings into a single variable separated by spaces. paste0()
does the same without the spaces):
paste("Hello","world!")
[1] "Hello world!"
paste0("Hello-","world!")
[1] "Hello-world!"
sub()
allows you to replace a defined portion of a text string:
cat.person <- "I love cats!"
cat.person
[1] "I love cats!"
dog.person <- sub("cats", "dogs", cat.person)
dog.person
[1] "I love dogs!"
Numeric Variables
The numeric data type is your most common tool for storing quantitative real numbers:
dozen <- 12
dozen
[1] 12
e <- exp(1) # exp is a function that defines the constant e to a given power
e
[1] 2.718282
negatives <- -3.5
negatives
[1] -3.5
We have already looked at two of many, many functions that manipulate numeric variables:
sum(dozen, e, negatives)
[1] 11.21828
sqrt(e)
[1] 1.648721
Some character functions automatically treat numbers (e.g. 12
) as the character equivalent ("12"
):
sub(2,"4",dozen)
[1] "14"
While others do not:
writeLines(dozen)
Error in writeLines(dozen): can only write character objects
Just try them to find out!
Logical Variables
The logical data type is used to store basic TRUE vs. FALSE data:
this.is.true <- TRUE
this.is.true
[1] TRUE
this.is.false <- F
this.is.false
[1] FALSE
These are useful for asking R questions about your data. For example, we can compare relative values:
x <- 4
y <- 5
x < y
[1] TRUE
This is useful for making decisions, with an if
statement for example:
test1 <- T
test2 <- F
if(test1) "Test 1 is TRUE!"
[1] "Test 1 is TRUE!"
if(test2) "Test 2 is TRUE!"
The code following the if statement will only execute if the variable or statement entered in the ()
is TRUE
. We will talk more about how to use if
statements later in the course.
Data Structures
Elements of these data types may be combined to form data structures–collections of individual datum. There are many data structure in R, for example:
- factors
- atomic vectors (aka vectors)
- lists
- matrices
- data frames
Here we will discuss the two most basic data structures: vectors and matrices. In later lessons we will discuss two advanced data structure, data frames (the most common data storage structure in R) and lists. Factors are a type of data structure, but function more like an advanced data type. You will explore factors in detail On Your Own.
Vectors
A vector is the most common and basic data structure in R. Vectors are also the major workhorse data structure of R. Technically, vectors can be one of two types:
- atomic vectors
- lists
However, the term “vector” most commonly refers to the atomic types and not to lists. Here we will examine atomic vectors (hereafter just called “vectors”). Lists have a critical place R as well, and will be the topic of a future lesson.
The Different Vector Modes
A vector is a collection of elements that are most commonly of mode character
, logical
, integer
or numeric
.
You can create an empty vector with the vector()
function. By default, the mode is logical
, but you can be more explicit using additional arguments, as shown in the examples below. A simpler solution is to just directly construct vectors of the desired mode using on of several available functions, such as character()
, numeric()
, etc.
vector() # an empty 'logical' (the default) vector
logical(0)
vector("character", length = 5) # a vector of mode 'character' with 5 elements
[1] "" "" "" "" ""
character(5) # the same thing, but using the direct constructor function
[1] "" "" "" "" ""
numeric(5) # a numeric vector with 5 elements
[1] 0 0 0 0 0
logical(5) # a logical vector with 5 elements
[1] FALSE FALSE FALSE FALSE FALSE
You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector based on your input data. To do this, you use the function c()
(which stands for “combine”):
# numeric vector
x <- c(1, 2, 3)
class(x)
[1] "numeric"
The c()
function is the most common way to define a vector of values for manipulation. We will use it frequently throughout this course, and it will be one of the functions that you use the most in your own analyses.
Directly specifying TRUE
and FALSE
will create a vector of mode logical
:
y <- c(TRUE, TRUE, FALSE, FALSE)
class(y)
[1] "logical"
While quoted text will create a vector of mode character
:
z <- c("Sarah", "Tracy", "Jon")
class(z)
[1] "character"
# adding quotes to numbers forces a character vector
x.char <- c("1", "2", "3")
class(x.char)
[1] "character"
Adding Elements
The function c()
can also be used to add elements to a vector:
z <- c(z, "Annette")
z
[1] "Sarah" "Tracy" "Jon" "Annette"
z <- c("Greg", z)
z
[1] "Greg" "Sarah" "Tracy" "Jon" "Annette"
Note that order matters and defines the order in the output vector. c()
treats any argument that is a vector as just another set of elements in the vector.
Vectors from a Sequence of Numbers
Use the seq()
function or the :
operator to create a vector as a sequence of numbers.
seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Check out the help documentation for the seq()
function (?seq
) to see what arguments we are providing and what arguments are being set to defaults. By specifying from
, to
, and by
we can customize our output vector:
seq(from = 1, to = 10, by = 0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
[16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
[31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
[46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
[61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
[76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
[91] 10.0
You can assign these sequences directly to a variable:
series1 <- 5:15
series1
[1] 5 6 7 8 9 10 11 12 13 14 15
series2 <- seq(from = 3, to = 8, by = 0.2)
series2
[1] 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6
[20] 6.8 7.0 7.2 7.4 7.6 7.8 8.0
What happens when you mix data types inside a vector?
R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.
Mixing data types in vectors
What the do you think the following will do (without running them first)?
z1 <- c(1.7, "a") z2 <- c(TRUE, 2) z3 <- c("a", TRUE)
Solution
z1 <- c(1.7, "a") class(z1)
[1] "character"
z1
is forced to be a character vector."1.7"
can be a character, while"a"
cannot be a number.z2 <- c(TRUE, 2) class(z2)
[1] "numeric"
z2
is forced to be a numeric vector.TRUE
can be interpreted as a number (TRUE
=1
,FALSE
=0
), while2
cannot be interpreted as a logical value (or can it?)z3 <- c("a", TRUE) class(z3)
[1] "character"
z3
is forced to be a character vector."TRUE"
can be a character, while"a"
cannot be interpreted as a logical value
Finding commonalities
What two properties are common to all of the vectors above?
Solution
Properties of vectors:
- All vectors are one-dimensional
- Each vector element is of the same type.
Indexing vectors
In R, []
are used to index vectors and other objects. For vectors, the number entered in the [n]
will return the nth element of the vector.
x <- c("a","b","c","d","e","f")
x
[1] "a" "b" "c" "d" "e" "f"
x[5]
[1] "e"
You can also use pre-defined variables or even other vectors to index different parts of the vector:
n <- 6
range <- 2:4
x[n] # returns the 6th element, as defined by n = 6
[1] "f"
x[range] # returns the range of values specified by "range", in this case elements 2 to 4.
[1] "b" "c" "d"
Subsetting data
Let’s look at a different subsetting option using a character vectors:
animal <- c("m", "o", "n", "k", "e", "y") # first three characters animal[1:3]
[1] "m" "o" "n"
# last three characters animal[4:6]
[1] "k" "e" "y"
Consider the following questions:
If the first four characters are selected using the subset
animal[1:4]
, how can we obtain the first four characters in reverse order?What output results from
animal[-1]
?What ouptut results from
animal[-4]
?Given 1-3, what do you expect
animal[-1:-4]
to produce?Use a subset of the
animal
vector to create a new character vector that spells the word “yoke”, i.e.c("y", "o", "k", "e")
.Solutions
animal[4:1]
animal[4:1]
[1] "k" "n" "o" "m"
"o" "n" "k" "e" "y"
"m" "o" "n" "e" "y"
, which means that a single-
removes the element at the given index position.
animal[-1:-4]
remove the subset at indexes 1 to 4, returning"e" "y"
, which is equivalent toanimal[5:6]
.animal[-1:-4]
[1] "e" "y"
animal[5:6]
[1] "e" "y"
animal[c(6,2,4,5)]
combines indexing with thec
ombine function to spell the word “yoke” in a new vector:animal[c(6,2,4,5)]
[1] "y" "o" "k" "e"
We will talk about more advanced indexing strategies later in the course.
Vectorized operations
R has a special way of dealing with vectors when dealing with operations. We know what to expect from 1 + 1
, but what happens if you try to add two vectors?
x <- c(1,2,3)
y <- c(4,5,6)
z <- x + y
z
[1] 5 7 9
R creates a new vector in which each element the sum of the elements with the same index from x and y. In essence, R is performing 3 separate “addition” operations and combining the results into a new vector. We can mimic this behavior manually:
z1 <- x[1] + y[1]
z2 <- x[2] + y[2]
z3 <- x[3] + y[3]
z <- c(z1, z2, z3)
z
[1] 5 7 9
This process is called “vectorization”. It works for most mathematical operations:
x - y
[1] -3 -3 -3
x * y
[1] 4 10 18
x / y
[1] 0.25 0.40 0.50
x ^ y
[1] 1 32 729
Many functions also behave in a vectorized manner. Take the paste()
function, for example, which combines two or more character variables into a single variable:
# Here is the output with two strings
paste("I like", "dogs.")
[1] "I like dogs."
# Now let's try pasting two character vectors together:
attitude <- c("I like","I dislike","I am indifferent to")
animal <- c("dogs.","fish.","cats.")
paste(attitude,animal)
[1] "I like dogs." "I dislike fish."
[3] "I am indifferent to cats."
Vectorization is not universal
Vectorization is one of the reasons that R is so powerful, and it is employed by a wide range of functions. However, it is not universal. Depending on how a particular function is written, it may act on the vector or on the list of values in the vector.
Compare the function
sum()
to the operator+
:x <- 1:3 y <- 4:6 sum(x,y) # sums the individual elements to produce a single number
[1] 21
x + y # sums the values in each index to produce a new vector
[1] 5 7 9
The easiest way to find out is to just give it a try and see what output it produces.
Object Attributes
R objects can have attributes. Attributes are metadata and part of the object. Each attribute describes a different aspect of the object. These include:
- names – some objects (e.g. lists) have named elements
- dim – the number of rows and columns in a matrix
- class – the data type of an object
- attributes – a list containing other forms of metadata of more complex objects
While technically not attributes, you can also glean other attribute-like metadata information from objects such as length (works on vectors and lists) or number of characters (for character strings).
length(1:10)
[1] 10
nchar("MCB 585")
[1] 7
We will periodically use object attributes to manipulate objects throughout the course, including the next topic: matrices.
Matrices
In R, matrices are an extension of vectors. They are not a separate type of object but simply an atomic vector with an attribute called “dimensions”, i.e. a specified number of rows and columns. As with vectors, the elements of a matrix must be of the same data type. We can use the generic matrix()
function to build a matrix. Unlike vectors, there is no direct equivalent for each data type (e.g. character()
). However, because matrices are really just vectors, we can use a predefined vector to build a matrix:
# first create a vector, then coerce that vector into a matrix:
v <- 1:4
m <- matrix(data = v, nrow = 2, ncol = 2)
# note the difference in structure
v
[1] 1 2 3 4
m
[,1] [,2]
[1,] 1 3
[2,] 2 4
We can now examine the attributes of our new matrix m
:
dim(m)
[1] 2 2
attributes(m)
$dim
[1] 2 2
Note that under the surface, R still treats m
as a vector, so the length()
of v
and m
are the same (i.e. both contain 4 elements):
length(v)
[1] 4
length(m)
[1] 4
Matrices are a higher-order object (a vector with additional attributes like dimensions dim()
). Thus the class()
function no longer tells you the data type for each element, but rather the data structure type of the entire object:
class(m)
[1] "matrix" "array"
You can check the data type of the elements of the matrix using typeof()
or mode()
, which give slightly different information:
typeof(m)
[1] "integer"
mode(m)
[1] "numeric"
While class()
shows that m is a matrix, and mode()
returns the higher-order data type numeric, typeof()
shows that fundamentally the matrix is an integer vector.
Note that one difference between vectors and matrices is that an otherwise identical vector will return the data type of each element when you use class()
, while the matrix is a new type of object with class()
“matrix”.
Data types of matrix elements
Consider the following matrix:
FOURS <- matrix( c(4, 4, 4, 4), nrow = 2, ncol = 2)
Given that
typeof(FOURS[1])
returns"double"
, what would you expecttypeof(FOURS)
to return? How do you know this is the case even without running this code?Solution
We know that
typeof(FOURS)
will also return"double"
since matrices are just vectors, and vectors must be made of elements of the same data type.In contrast,
class(FOURS)
returns"matrix"
whileclass(FOURS[1])
returns the class of that single element,"numeric"
.
By default, matrices in R are filled column-wise:
m1 <- matrix(1:6, nrow = 2, ncol = 3)
m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
… unless you tell it to fill by row explicitly using the byrow
argument:
m2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
m2
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Another way to construct a matrix is to assign values to the dim
attribute:
m <- 1:10
# so far m is just a vector!
m
[1] 1 2 3 4 5 6 7 8 9 10
class(m)
[1] "integer"
# defining the "dimensions" attribute automatically converts m to a matrix
dim(m) <- c(2, 5)
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
class(m)
[1] "matrix" "array"
This takes a vector and transforms it into a matrix with 2 rows and 5 columns. A third way is to bind columns or rows using rbind()
and cbind()
(“row bind” and “column bind”, respectively).
x <- 1:3
y <- 10:12
cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
Note that the vectors being bound are of the same length in this case. If the vectors are of different length, the elements of the shorter vector will be repeated to fill in the missing space:
z <- 1:10
rbind(x,z)
Warning in rbind(x, z): number of columns of result is not a multiple of vector
length (arg 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x 1 2 3 1 2 3 1 2 3 1
z 1 2 3 4 5 6 7 8 9 10
Indexing matrices
Like vectors, []
are used to index matrices. Since matrices are, by definition, two-dimensional, use [m,n]
to index the mth row and nth column of a matrix. The row is always specified before the ,
and the column after.
m <- matrix(1:10, nrow = 2, ncol = 5)
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
m[2, 3]
[1] 6
If you are only interested in indexing a specific row, but do not want to change the columns, you can just leave the column index blank (but don’t forget the ,
!):
m[2,]
[1] 2 4 6 8 10
Subsetting data
Let’s look at a different ways to manipulate matrices. First, let’s define a simple matrix to play with:
m <- matrix(1:12, nrow=3) m
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12
Keeping in mind what you know about the behavior of vectors, consider the
following questions:
How can you use indexing to extract the middle of the matrix (e.g.
5
and8
)?What output do you expect from
m[-2,]
?What output do you expect from
m[,2:3]
?What result do expect when you try including
m
in simple multiplication:2*m
?Can you predict what will happen if you try
m[,c(1,3)]
?There is a useful function
t()
. Tryt(m)
. Based on the output, what doest()
do? What do you think thet
stands for?What happens if we only ask for a single index (
m[4]
)?Solutions
The numbers
5
and8
appear in the second row, and the second and third columns, respectively. To extract this matrix subset, we use the index:m[2,2:3]
[1] 5 8
Like in vectors, the
-
tells R to exclude the index that follows. In this case, exclude row 2. The column index is blank, so all columns are returned. Thus we end up with a smaller matrix with only rows 1 and 3, and all elements in row 2 removed:[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 3 6 9 12
As with vectors, including a set of sequential values in either index will return all of the indexes in that range. In this case rows are blank (so include all rows), and columns 2-3 are requested, so we will get the following 2x2 matrix:
[,1] [,2] [1,] 4 7 [2,] 5 8 [3,] 6 9
Because matrices are essentially vectors with attributes, all standard operations are “vectorized”. We thus expect a new matrix with the same dimensions as m, but with each element equal to twice the corresponding element in m:
[,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24
While we haven’t covered it explicity, entering
2:3
into an index (like in question 3) is equivalent to entering an list of values (e.g.c(2,3)
). Any list of this sort can be used to index a specific (and not necessarily sequential) set of rows or columns. Thus we expect the listc(1,3)
entered into the column index to return columns 1 and 3:[,1] [,2] [1,] 1 7 [2,] 2 8 [3,] 3 9
Let’s see what happens if we use the
t()
function onm
:t(m)
[,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12
The
t
stands for “transpose”, and as we can see from the result,t(m)
simply “flips” the matrix so that rows are now columns and columns are now rows.
Since a matrix is just a fancy vector, requesting R to return the 4th index (
m[4]
) will simply return the 4th value in the underlying vector:[1] 4
What if we ask for an index outside the range of a matrix?
m[3,15]
Error in m[3, 15]: subscript out of bounds
This error universally occurs whenever you request an invalid index to an object in R (vector, matrix, list, data frame, etc.)
Key Points
The most commonly encountered data types in R are character, numeric, and logical.
R’s basic data structures are vectors and matrices.
Objects may have attributes, such as name, dimension, and class.
Use
object[x]
andobject[x, y]
to select a single element from a 2- and 3-dimensional data structure, respectively.Use
from:to
to specify a sequence that includes the indices fromfrom
toto
.