Basic Data Types and Data Structures in R

Overview

Class Date: 8/29/2024 -- In Class
Teaching: 90 min
Exercises: 30 min
Questions
  • What are the most common data types in R?

  • What are the basic data structures in R?

  • How do I access data within the basic data structures?

Objectives
  • Understand the most commonly encountered data types in R and how these data types are used in data structures.

  • Create and manipulate vectors and matrices of different types.

  • Check the data type of a variable, vector, or matrix.

  • Understand the structure and properties of basic data structures (vectors and matrices).

In Class

Basic Data Types in R

R uses a variety of data types, which define the properties of the value stored in a variable. The three data types that you will use most commonly are character (text strings), logical (TRUE/FALSE values), and numeric (decimal or “double” numeric values) objects. For the most part, the data type of a variable is detected by the format of the value assigned:

char1 <- "hello!" # "character" is your basic text string data type
num1 <- 20.5 # "numeric" is your most general data type for real, decimal numbers
logic1 <- TRUE # "logical" data type is a simply binary: TRUE or FALSE
logic2 <- F # T or F also work

 

Use the class() function to determine the type of a variable:

class(char1)
[1] "character"
class(num1)
[1] "numeric"
class(logic1)
[1] "logical"

 

Character Variables

The character data type is used to store basic textual information. A character vector is defined by text in quotes (""). You can include most forms of text:

greeting <- "How are you today?"
equation <- "1 + 3 - 2 = 2"

 

The major exception is the backslash (\), which R uses as an escape character. In computing, an escape character is a metacharacter that is not interpreted literally, but instead causes the computer to interpret the character following the escape character in a different way.

escape <- "This string \does not compute."
Error: '\d' is an unrecognized escape in character string (<text>:1:25)

 

Note that the error message is complaining about the \d, not just the \. Many characters invoke specific behavior when preceded by the \ which can be useful depending on your goals. For example, \n is interpreted as a line break:

one.line <- "Line 1 Line 2"
two.lines <- "Line 1\nLine 2"

writeLines(one.line)
Line 1 Line 2
writeLines(two.lines)
Line 1
Line 2

 

This can be useful when you are trying to add text to a chart but want the text to appear on separate lines. Some functions just display the string verbatim, instead of trying to interpret the escape characters:

print(two.lines)
[1] "Line 1\nLine 2"

 

As you might expect, you can’t perform numeric operations on characters:

2 * two.lines
Error in 2 * two.lines: non-numeric argument to binary operator

 

So what if you actually want your string to include a backslash (\)? You can do this by typing a double backslash (\\), which is effectively using the first \ to “escape” the second \:

writeLines("Single\\backslash.")
Single\backslash.

 

There are many functions for manipulating character data types. Two examples are paste(), which combines text strings into a single variable separated by spaces. paste0() does the same without the spaces):

paste("Hello","world!")
[1] "Hello world!"
paste0("Hello-","world!")
[1] "Hello-world!"

 

sub() allows you to replace a defined portion of a text string:

cat.person <- "I love cats!"
cat.person
[1] "I love cats!"
dog.person <- sub("cats", "dogs", cat.person)
dog.person
[1] "I love dogs!"

 

Numeric Variables

The numeric data type is your most common tool for storing quantitative real numbers:

dozen <- 12
dozen
[1] 12
e <- exp(1) # exp is a function that defines the constant e to a given power
e
[1] 2.718282
negatives <- -3.5
negatives
[1] -3.5

 

We have already looked at two of many, many functions that manipulate numeric variables:

sum(dozen, e, negatives)
[1] 11.21828
sqrt(e)
[1] 1.648721

 

Some character functions automatically treat numbers (e.g. 12) as the character equivalent ("12"):

sub(2,"4",dozen)
[1] "14"

 

While others do not:

writeLines(dozen)
Error in writeLines(dozen): can only write character objects

 

Just try them to find out!

 

Logical Variables

The logical data type is used to store basic TRUE vs. FALSE data:

this.is.true <- TRUE
this.is.true
[1] TRUE
this.is.false <- F
this.is.false
[1] FALSE

 

These are useful for asking R questions about your data. For example, we can compare relative values:

x <- 4
y <- 5
x < y
[1] TRUE

 

This is useful for making decisions, with an if statement for example:

test1 <- T
test2 <- F
if(test1) "Test 1 is TRUE!"
[1] "Test 1 is TRUE!"
if(test2) "Test 2 is TRUE!"

 

The code following the if statement will only execute if the variable or statement entered in the () is TRUE. We will talk more about how to use if statements later in the course.


Data Structures

Elements of these data types may be combined to form data structures–collections of individual datum. There are many data structure in R, for example:

Here we will discuss the two most basic data structures: vectors and matrices. In later lessons we will discuss two advanced data structure, data frames (the most common data storage structure in R) and lists. Factors are a type of data structure, but function more like an advanced data type. You will explore factors in detail On Your Own.

 

Vectors

A vector is the most common and basic data structure in R. Vectors are also the major workhorse data structure of R. Technically, vectors can be one of two types:

However, the term “vector” most commonly refers to the atomic types and not to lists. Here we will examine atomic vectors (hereafter just called “vectors”). Lists have a critical place R as well, and will be the topic of a future lesson.

 

The Different Vector Modes

A vector is a collection of elements that are most commonly of mode character, logical, integer or numeric.

You can create an empty vector with the vector() function. By default, the mode is logical, but you can be more explicit using additional arguments, as shown in the examples below. A simpler solution is to just directly construct vectors of the desired mode using on of several available functions, such as character(), numeric(), etc.

vector() # an empty 'logical' (the default) vector
logical(0)
vector("character", length = 5) # a vector of mode 'character' with 5 elements
[1] "" "" "" "" ""
character(5) # the same thing, but using the direct constructor function
[1] "" "" "" "" ""
numeric(5)   # a numeric vector with 5 elements
[1] 0 0 0 0 0
logical(5)   # a logical vector with 5 elements
[1] FALSE FALSE FALSE FALSE FALSE

 

You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector based on your input data. To do this, you use the function c() (which stands for “combine”):

# numeric vector
x <- c(1, 2, 3)
class(x)
[1] "numeric"

 

The c() function is the most common way to define a vector of values for manipulation. We will use it frequently throughout this course, and it will be one of the functions that you use the most in your own analyses.

Directly specifying TRUE and FALSE will create a vector of mode logical:

y <- c(TRUE, TRUE, FALSE, FALSE)
class(y)
[1] "logical"

 

While quoted text will create a vector of mode character:

z <- c("Sarah", "Tracy", "Jon")
class(z)
[1] "character"
# adding quotes to numbers forces a character vector
x.char <- c("1", "2", "3")
class(x.char)
[1] "character"

 

Adding Elements

The function c() can also be used to add elements to a vector:

z <- c(z, "Annette")
z
[1] "Sarah"   "Tracy"   "Jon"     "Annette"
z <- c("Greg", z)
z
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

 

Note that order matters and defines the order in the output vector. c() treats any argument that is a vector as just another set of elements in the vector.

 

Vectors from a Sequence of Numbers

Use the seq() function or the : operator to create a vector as a sequence of numbers.

seq(10)
 [1]  1  2  3  4  5  6  7  8  9 10
1:10
 [1]  1  2  3  4  5  6  7  8  9 10

 

Check out the help documentation for the seq() function (?seq) to see what arguments we are providing and what arguments are being set to defaults. By specifying from, to, and by we can customize our output vector:

seq(from = 1, to = 10, by = 0.1)
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

 

You can assign these sequences directly to a variable:

series1 <- 5:15
series1
 [1]  5  6  7  8  9 10 11 12 13 14 15
series2 <- seq(from = 3, to = 8, by = 0.2)
series2
 [1] 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6
[20] 6.8 7.0 7.2 7.4 7.6 7.8 8.0

 

What happens when you mix data types inside a vector?

R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.

Mixing data types in vectors

What the do you think the following will do (without running them first)?

z1 <- c(1.7, "a")
z2 <- c(TRUE, 2)
z3 <- c("a", TRUE)

Solution

z1 <- c(1.7, "a") 
class(z1)
[1] "character"

z1 is forced to be a character vector. "1.7" can be a character, while "a" cannot be a number.

z2 <- c(TRUE, 2) 
class(z2)
[1] "numeric"

z2 is forced to be a numeric vector. TRUE can be interpreted as a number (TRUE = 1, FALSE = 0), while 2 cannot be interpreted as a logical value (or can it?)

z3 <- c("a", TRUE)
class(z3)
[1] "character"

z3 is forced to be a character vector. "TRUE" can be a character, while "a" cannot be interpreted as a logical value

 

Finding commonalities

What two properties are common to all of the vectors above?

Solution

Properties of vectors:

  • All vectors are one-dimensional
  • Each vector element is of the same type.

 

Indexing vectors

In R, [] are used to index vectors and other objects. For vectors, the number entered in the [n] will return the nth element of the vector.

x <- c("a","b","c","d","e","f")
x
[1] "a" "b" "c" "d" "e" "f"
x[5]
[1] "e"

 

You can also use pre-defined variables or even other vectors to index different parts of the vector:

n <- 6
range <- 2:4

x[n] # returns the 6th element, as defined by n = 6
[1] "f"
x[range] # returns the range of values specified by "range", in this case elements 2 to 4.
[1] "b" "c" "d"

Subsetting data

Let’s look at a different subsetting option using a character vectors:

animal <- c("m", "o", "n", "k", "e", "y")

# first three characters
animal[1:3]
[1] "m" "o" "n"
# last three characters
animal[4:6]
[1] "k" "e" "y"

 

Consider the following questions:

  1. If the first four characters are selected using the subset animal[1:4], how can we obtain the first four characters in reverse order?

  2. What output results from animal[-1]?

  3. What ouptut results from animal[-4]?

  4. Given 1-3, what do you expect animal[-1:-4] to produce?

  5. Use a subset of the animal vector to create a new character vector that spells the word “yoke”, i.e. c("y", "o", "k", "e").

Solutions

  1. animal[4:1]

     animal[4:1]
    
     [1] "k" "n" "o" "m"
    
  2. "o" "n" "k" "e" "y"

  3. "m" "o" "n" "e" "y", which means that a single - removes the element at the given index position.

  4. animal[-1:-4] remove the subset at indexes 1 to 4, returning "e" "y", which is equivalent to animal[5:6].

     animal[-1:-4]
    
     [1] "e" "y"
    
     animal[5:6]
    
     [1] "e" "y"
    
  5. animal[c(6,2,4,5)] combines indexing with the combine function to spell the word “yoke” in a new vector:

     animal[c(6,2,4,5)]
    
     [1] "y" "o" "k" "e"
    

 

We will talk about more advanced indexing strategies later in the course.

 

Vectorized operations

R has a special way of dealing with vectors when dealing with operations. We know what to expect from 1 + 1, but what happens if you try to add two vectors?

x <- c(1,2,3)
y <- c(4,5,6)

z <- x + y
z
[1] 5 7 9

 

R creates a new vector in which each element the sum of the elements with the same index from x and y. In essence, R is performing 3 separate “addition” operations and combining the results into a new vector. We can mimic this behavior manually:

z1 <- x[1] + y[1]
z2 <- x[2] + y[2]
z3 <- x[3] + y[3]

z <- c(z1, z2, z3)
z
[1] 5 7 9

 

This process is called “vectorization”. It works for most mathematical operations:

x - y
[1] -3 -3 -3
x * y
[1]  4 10 18
x / y
[1] 0.25 0.40 0.50
x ^ y
[1]   1  32 729

 

Many functions also behave in a vectorized manner. Take the paste() function, for example, which combines two or more character variables into a single variable:

# Here is the output with two strings
paste("I like", "dogs.")
[1] "I like dogs."
# Now let's try pasting two character vectors together:
attitude <- c("I like","I dislike","I am indifferent to")
animal <-  c("dogs.","fish.","cats.")

paste(attitude,animal)
[1] "I like dogs."              "I dislike fish."          
[3] "I am indifferent to cats."

 

Vectorization is not universal

Vectorization is one of the reasons that R is so powerful, and it is employed by a wide range of functions. However, it is not universal. Depending on how a particular function is written, it may act on the vector or on the list of values in the vector.

Compare the function sum() to the operator +:

x <- 1:3
y <- 4:6
sum(x,y) # sums the individual elements to produce a single number
[1] 21
x + y # sums the values in each index to produce a new vector
[1] 5 7 9

 

The easiest way to find out is to just give it a try and see what output it produces.


Object Attributes

R objects can have attributes. Attributes are metadata and part of the object. Each attribute describes a different aspect of the object. These include:

While technically not attributes, you can also glean other attribute-like metadata information from objects such as length (works on vectors and lists) or number of characters (for character strings).

length(1:10)
[1] 10
nchar("MCB 585")
[1] 7

 

We will periodically use object attributes to manipulate objects throughout the course, including the next topic: matrices.


Matrices

In R, matrices are an extension of vectors. They are not a separate type of object but simply an atomic vector with an attribute called “dimensions”, i.e. a specified number of rows and columns. As with vectors, the elements of a matrix must be of the same data type. We can use the generic matrix() function to build a matrix. Unlike vectors, there is no direct equivalent for each data type (e.g. character()). However, because matrices are really just vectors, we can use a predefined vector to build a matrix:

# first create a vector, then coerce that vector into a matrix:
v <- 1:4
m <- matrix(data = v, nrow = 2, ncol = 2)

# note the difference in structure
v
[1] 1 2 3 4
m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

 

We can now examine the attributes of our new matrix m:

dim(m)
[1] 2 2
attributes(m)
$dim
[1] 2 2

 

Note that under the surface, R still treats m as a vector, so the length() of v and m are the same (i.e. both contain 4 elements):

length(v)
[1] 4
length(m)
[1] 4

 

Matrices are a higher-order object (a vector with additional attributes like dimensions dim()). Thus the class() function no longer tells you the data type for each element, but rather the data structure type of the entire object:

class(m)
[1] "matrix" "array" 

 

You can check the data type of the elements of the matrix using typeof() or mode(), which give slightly different information:

typeof(m)
[1] "integer"
mode(m)
[1] "numeric"

 

While class() shows that m is a matrix, and mode() returns the higher-order data type numeric, typeof() shows that fundamentally the matrix is an integer vector.

Note that one difference between vectors and matrices is that an otherwise identical vector will return the data type of each element when you use class(), while the matrix is a new type of object with class() “matrix”.

Data types of matrix elements

Consider the following matrix:

FOURS <- matrix(
  c(4, 4, 4, 4),
  nrow = 2,
  ncol = 2)

Given that typeof(FOURS[1]) returns "double", what would you expect typeof(FOURS) to return? How do you know this is the case even without running this code?

Solution

We know that typeof(FOURS) will also return "double" since matrices are just vectors, and vectors must be made of elements of the same data type.

In contrast, class(FOURS) returns "matrix" while class(FOURS[1]) returns the class of that single element, "numeric".

 

By default, matrices in R are filled column-wise:

m1 <- matrix(1:6, nrow = 2, ncol = 3)
m1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

… unless you tell it to fill by row explicitly using the byrow argument:

m2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
m2
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

 

Another way to construct a matrix is to assign values to the dim attribute:

m <- 1:10

# so far m is just a vector!
m
 [1]  1  2  3  4  5  6  7  8  9 10
class(m)
[1] "integer"
# defining the "dimensions" attribute automatically converts m to a matrix
dim(m) <- c(2, 5)
m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
class(m)
[1] "matrix" "array" 

 

This takes a vector and transforms it into a matrix with 2 rows and 5 columns. A third way is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively).

x <- 1:3
y <- 10:12
cbind(x, y)
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

 

Note that the vectors being bound are of the same length in this case. If the vectors are of different length, the elements of the shorter vector will be repeated to fill in the missing space:

z <- 1:10
rbind(x,z)
Warning in rbind(x, z): number of columns of result is not a multiple of vector
length (arg 1)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    1    2    3    1    2    3     1
z    1    2    3    4    5    6    7    8    9    10

 

Indexing matrices

Like vectors, [] are used to index matrices. Since matrices are, by definition, two-dimensional, use [m,n] to index the mth row and nth column of a matrix. The row is always specified before the , and the column after.

m <- matrix(1:10, nrow = 2, ncol = 5)
m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
m[2, 3]
[1] 6

 

If you are only interested in indexing a specific row, but do not want to change the columns, you can just leave the column index blank (but don’t forget the ,!):

m[2,]
[1]  2  4  6  8 10

 

Subsetting data

Let’s look at a different ways to manipulate matrices. First, let’s define a simple matrix to play with:

m <- matrix(1:12, nrow=3)
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

 

Keeping in mind what you know about the behavior of vectors, consider the
following questions:

  1. How can you use indexing to extract the middle of the matrix (e.g. 5 and 8)?

  2. What output do you expect from m[-2,]?

  3. What output do you expect from m[,2:3]?

  4. What result do expect when you try including m in simple multiplication: 2*m?

  5. Can you predict what will happen if you try m[,c(1,3)]?

  6. There is a useful function t(). Try t(m). Based on the output, what does t() do? What do you think the t stands for?

  7. What happens if we only ask for a single index (m[4])?

Solutions

  1. The numbers 5 and 8 appear in the second row, and the second and third columns, respectively. To extract this matrix subset, we use the index:

     m[2,2:3]
    
     [1] 5 8
    

 

  1. Like in vectors, the - tells R to exclude the index that follows. In this case, exclude row 2. The column index is blank, so all columns are returned. Thus we end up with a smaller matrix with only rows 1 and 3, and all elements in row 2 removed:

          [,1] [,2] [,3] [,4]
     [1,]    1    4    7   10
     [2,]    3    6    9   12
    

 

  1. As with vectors, including a set of sequential values in either index will return all of the indexes in that range. In this case rows are blank (so include all rows), and columns 2-3 are requested, so we will get the following 2x2 matrix:

          [,1] [,2]
     [1,]    4    7
     [2,]    5    8
     [3,]    6    9
    

 

  1. Because matrices are essentially vectors with attributes, all standard operations are “vectorized”. We thus expect a new matrix with the same dimensions as m, but with each element equal to twice the corresponding element in m:

          [,1] [,2] [,3] [,4]
     [1,]    2    8   14   20
     [2,]    4   10   16   22
     [3,]    6   12   18   24
    

 

  1. While we haven’t covered it explicity, entering 2:3 into an index (like in question 3) is equivalent to entering an list of values (e.g. c(2,3)). Any list of this sort can be used to index a specific (and not necessarily sequential) set of rows or columns. Thus we expect the list c(1,3) entered into the column index to return columns 1 and 3:

          [,1] [,2]
     [1,]    1    7
     [2,]    2    8
     [3,]    3    9
    

 

  1. Let’s see what happens if we use the t() function on m:

     t(m)
    
          [,1] [,2] [,3]
     [1,]    1    2    3
     [2,]    4    5    6
     [3,]    7    8    9
     [4,]   10   11   12
    

 

The t stands for “transpose”, and as we can see from the result, t(m) simply “flips” the matrix so that rows are now columns and columns are now rows.

 

  1. Since a matrix is just a fancy vector, requesting R to return the 4th index (m[4]) will simply return the 4th value in the underlying vector:

     [1] 4
    

 

What if we ask for an index outside the range of a matrix?

m[3,15]
Error in m[3, 15]: subscript out of bounds

 

This error universally occurs whenever you request an invalid index to an object in R (vector, matrix, list, data frame, etc.)


Key Points

  • The most commonly encountered data types in R are character, numeric, and logical.

  • R’s basic data structures are vectors and matrices.

  • Objects may have attributes, such as name, dimension, and class.

  • Use object[x] and object[x, y] to select a single element from a 2- and 3-dimensional data structure, respectively.

  • Use from:to to specify a sequence that includes the indices from from to to.