R Data Types -- In-Depth
Overview
Class Date: 8/29/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 minQuestions
What are the basic data types in R?
What are factors and how do they differ from other data types?
How is missing data represented in R?
How is infinity represented in R?
Objectives
Understand basic data types in R and how these data types are used in data structures.
Understand the structure and properties of factors.
Be able to explain the difference between ordered and unordered factors.
Be aware of some of the problems encountered when using factors.
Understand missing data and other special values (e.g. infinity).
On Your Own
A more in-depth look at Data Types in R
To make the best of the R language, you’ll need a thorough understanding of the basic data types and data structures and how to operate on them.
Data structures are important to understand because these are the primary objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
All variables in R–basic data types, vectors, matrices, data frames, lists–are objects. Objects have a variety of attributes that define the way they interact with functions, other objects, and operators. Understanding how different classes of objects and their attributes are organized is essential to working within the R environment.
R has 6 basic data types. In addition to the five listed below, there is also raw which is rarely used and will not be discussed in this course.
- character:
"a"
,"swc"
,"MCB 585"
(the “quotes” define a character string and allow spaces to be included) - numeric:
2
,15.5
(real or decimal numbers; aka “doubles”) - integer:
-14L
,2L
(theL
tells R to store this as an integer rather than a numeric value) - logical:
TRUE
,FALSE
- complex:
1+4i
(complex numbers with real and imaginary parts)
We discussed the three most common data types In Class–characters, numeric, and logical. Now let’s take a closer look at the remaining two.
Integers
During the In Class segment, we discussed three primary data types: character, numeric, and logical, because these are the most commonly used. Technically, numeric is a data type category that contains both doubles (double precision, or decimal, numbers) and integers, though often numeric is used as short-hand to refer to doubles. If you wanted to explicitly create integers–the set of 0, all whole numbers (1, 2, 3, …), and their negative inverses (-1, -2, -3, …)–you need to add an L
to the numeric value:
x1 <- 1L
class(x1)
[1] "integer"
Note that if you are not explicit with the L
, R will convert your integer into a numeric:
x2 <- x1 + 4L
class(x2)
[1] "integer"
x3 <- x1 + 4
class(x3)
[1] "numeric"
x4 <- x1 + 4.4
class(x4)
[1] "numeric"
x5 <- 2 * x1
class(x5)
[1] "numeric"
x6 <- 2L * x1
class(x6)
[1] "integer"
R will also convert to numeric if you try and force a decimal value to be an integer by including the L
:
not.an.integer <- 1.2L
class(not.an.integer)
[1] "numeric"
Complex numbers
Complex numbers are numbers that consist of two parts, \(a + bi\), in which \(i^2 = -1\). Because the solution to this equation, \(i = \sqrt{-1}\), is not a real number, \(i\) is referred to as an imaginary number. In complex numbers, \(a\) is the real part and \(b\) is the imaginary part.
Complex numbers are represented in R using the complex data type. These can be created directly or using the complex()
function:
c1 <- 4+3i
c1
[1] 4+3i
class(c1)
[1] "complex"
c2 <- complex(real = 3, imaginary = 2.5)
c2
[1] 3+2.5i
class(c2)
[1] "complex"
Note that you have to define \(b\) explicitly, and that it must be right next to the i (bi
), not multiplied by the i (b*i
):
c3 <- 1+i # incorrect
Error in eval(expr, envir, enclos): object 'i' not found
c3 <- 1+1*i # incorrect
Error in eval(expr, envir, enclos): object 'i' not found
c3 <- 1+1i # correct
Complex numbers can be manipulated with operators and functions:
2*c3
[1] 2+2i
c1 + c3
[1] 5+4i
c2^2
[1] 2.75+15i
sum(c1, c3)
[1] 5+4i
Adding numeric values to a complex variable results in a complex variable:
c4 <- c1 + 10
c4
[1] 14+3i
class(c4)
[1] "complex"
Both integer or complex data types are necessary for certain types of analysis. Know they exist and how to deal with them if they come up in one of your applications.
Converting between data types
You can shift between data types using the as.
functions:
n1 <- 42
n1.as.char <- as.character(n1)
n1.as.char
[1] "42"
# ... now check the class of the new variable:
class(n1.as.char)
[1] "character"
# you can't add numbers to characters!
num1.as.char + 4
Error in eval(expr, envir, enclos): object 'num1.as.char' not found
It only works if the conversion makes sense in context. R also does not understand non-numeric references to numbers (e.g. using "two"
to refer to the number 2
).
# this one works:
as.numeric("44")
[1] 44
# these don't
as.numeric("hello!")
Warning: NAs introduced by coercion
[1] NA
as.numeric("forty-four")
Warning: NAs introduced by coercion
[1] NA
Note that when one of the as.
functions throws an error, it doesn’t simply fail to return a variable. Instead it throws an warning message and assigns an NA
value to the variable. NA
is one of several special values that represents missing data, or “Not Available”. We will discuss these special characters in more detail later in this lesson.
Sometimes these functions can have unintended consequences. When we apply the as.integer()
function to a numeric, it automatically rounds decimal number “down” (i.e. “toward 0”) to the nearest integer:
as.integer(1)
[1] 1
as.integer(0.1)
[1] 0
as.integer(0.9)
[1] 0
as.integer(1.1)
[1] 1
as.integer(-1.1)
[1] -1
as.integer(-0.9)
[1] 0
The as.logical()
function will take 0
as FALSE
and any non-zero numeric as TRUE
. It will throw an error for any character input that is not a common spelling of TRUE
or FALSE
. Note that capitalization matters here: T
works, but t
does not.
as.logical(0)
[1] FALSE
as.logical(1)
[1] TRUE
as.logical(10)
[1] TRUE
as.logical(0.01)
[1] TRUE
as.logical(-4)
[1] TRUE
as.logical("TRUE")
[1] TRUE
as.logical("True")
[1] TRUE
as.logical("tRUE")
[1] NA
as.logical("T")
[1] TRUE
as.logical("t")
[1] NA
as.logical("false")
[1] FALSE
There is a basic underlying hierarchy to data types that runs from more general (character) to more specific (integer), and conversions only tend to work in the more general direction. For additional discussion of data types in R (and many other topics), check out the Vectors chapter in the book R for Data Science, by Garrett Grolemund and Hadley Wickham.
You can also coerce vectors and matrices using the same functions as.integer()
.
num.vec <- seq(0.1,1,0.1)
num.vec
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
class(num.vec)
[1] "numeric"
int.vec <- as.integer(num.vec)
int.vec
[1] 0 0 0 0 0 0 0 0 0 1
class(int.vec)
[1] "integer"
Factors
Factors are technically a data structure but function as a special data type in R. Factors are primarily used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors look (and often behave) like character vectors, but assuming that they are character vectors can lead to unexpected behavior. Factors are actually an odd hybrid of integers and characters under the hood, and you need to be careful when treating them like character strings.
Factors have three essential properties:
- A vector of integers.
- A set of labels defining a name corresponding to each integer.
- A defined order for the labels (for ordered factors).
The integer defines the value of each element in the factor, the label indicates what that value means, and the order defines the relationship between the values.
Once created, each element of a factor can only contain a pre-defined set values, known as levels. Labels and levels essentially refer to the same thing and the terms can be used interchangeably, for the most part. The labels variable in the factor()
function defines the levels attribute in the created factor. By default, R sorts levels in alphabetical order. For instance, let’s use the factor()
command to create a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
sex
[1] male female female male
Levels: female male
Now compare this to a similar character
vector:
sex.char <- c("male", "female", "female", "male")
sex.char
[1] "male" "female" "female" "male"
Note that the elements of sex.char
have quotation marks, while the object sex
had a list of levels.
R will assign 1
to the level "female"
and 2
to the level "male"
(because f
comes before m
in the alphabet, even though the first element in this vector is "male"
). You can check the current order using the function levels()
, and check the number of levels using nlevels()
:
levels(sex)
[1] "female" "male"
nlevels(sex)
[1] 2
Use the str()
function to see both the levels (“female”, “male”) and the underlying integer representation of the factor:
str(sex)
Factor w/ 2 levels "female","male": 2 1 1 2
Note that the order in which the levels appear defines which level corresponds to which integer number.
The major functional difference between character and factor objects is that the elements of the character vector only have the inherent order defined by their values (e.g. alphabetical). Sometimes, the order of the factor elements does not matter; other times you might want to specify the order because it is meaningful. For instance, "low"
, "medium"
, "high"
as elements of a character vector have the implicit alphabetical order:
"high"
< "low"
< "medium"
while the more meaningful conceptual ordering is:
"low"
< "medium"
< "high"
By default, factor levels take on the alphabetical order:
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
[1] "high" "low" "medium"
Adding the levels
argument to factor()
function defines the level order if you do not want it to be alphabetical:
food <- factor(food, levels = c("low", "medium", "high"))
levels(food)
[1] "low" "medium" "high"
Note that “relative” operations do not work with factors unless they are ordered. The function min()
returns the minimum value in an integer vector, but not for either the food
or sex
factor vectors:
n1 <- 1:10
min(n1)
[1] 1
min(food)
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors
min(sex)
Error in Summary.factor(structure(c(2L, 1L, 1L, 2L), levels = c("female", : 'min' not meaningful for factors
Even though we entered the levels of food
in a specific order, R does not assume that the order that we entered the strings implies relative value unless we make it explicit. We can specify relative values for levels using the ordered
argument in the factor()
(which defaults to FALSE
if unspecified):
food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)
levels(food)
[1] "low" "medium" "high"
min(food)
[1] low
Levels: low < medium < high
Now class()
and str()
both reflect the order and relative values of the levels, respectively:
class(sex)
[1] "factor"
str(sex)
Factor w/ 2 levels "female","male": 2 1 1 2
class(food)
[1] "ordered" "factor"
str(food)
Ord.factor w/ 3 levels "low"<"medium"<..: 1 3 2 3 1 2 3
Note that numeric operations still do not work, so you can’t assume factors behave like integers either:
food[1] + food[2]
Warning in Ops.ordered(food[1], food[2]): '+' is not meaningful for ordered
factors
[1] NA
In R’s memory, these factors are represented by numbers (1, 2, 3). This is better than using simple integer labels because factors are self describing:
"low"
, "medium"
, and "high"
” is more descriptive than 1
, 2
, 3
. Which is low? You would not necessarily be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels, such as a factor vector containing the unique patient identifiers for a data set containing several thousand patients.
Adding elements to factors
For the most part, adding elements to a factor works similarly to adding elements to other types of vectors:
food[8] <- "low"
food
[1] low high medium high low medium high low
Levels: low < medium < high
Note that skipping an element introduces NA
s in the intermediate elements. More on this later…
food[10] <- "high"
food
[1] low high medium high low medium high low <NA> high
Levels: low < medium < high
The exception occurs when you try to add a new level that is not defined within the factor:
food[11] <- "very high"
Warning in `[<-.factor`(`*tmp*`, 11, value = "very high"): invalid factor
level, NA generated
food
[1] low high medium high low medium high low <NA> high
[11] <NA>
Levels: low < medium < high
Since the requested entry is not a valid level, R generates an NA
instead. To add this element, we first must redefine the valid list of levels. You could use the factor()
function with “very high” included in the levels
argument, but a simply way is to assign the new value using the levels()
function
levels(food) <- c(levels(food), "very high")
levels(food)
[1] "low" "medium" "high" "very high"
Here we just used the c()
function to append a new value onto the existing list of levels. Now we can assign the new element:
food[11] <- "very high"
food
[1] low high medium high low medium high
[8] low <NA> high very high
Levels: low < medium < high < very high
Converting Factors
Converting from a factor to a number can cause problems:
f <- factor(c(3.4, 1.2, 5))
as.numeric(f)
[1] 2 1 3
This does not behave as expected (and there is no warning). The reason is that the apparent numeric values are actually stored as integers (2
, 1
, 3
) with labels ("3.4"
, "1.2"
, "5"
). R uses the integer value when trying to perform the as.numeric()
function.
The recommended way is to use the integer vector to index the factor levels:
levels(f)[f]
[1] "3.4" "1.2" "5"
Remember that the factor really consists of two elements:
- The ordered integer list:
2, 1, 3
- The “key” indicating which integer corresponds to which level:
1 = "1.2"
,2 = "3.4"
,3 = "5"
To break down the levels(f)[f]
:
- First we grab the list of levels using
levels(f)
, which outputs a character vector:"1.2" "3.4" "5"
. - Next we index this list with
[f]
. Because the index requests a numeric representation of the factorf
, R replaces the[f]
with[c(2,1,3)]
(the integer portion of the factor object). - R returns the elements of the character list in (1) with the order indicated by the integer list in (2).
Note that the output from levels(f)[f]
is actually a character vector, because we indexed the list of levels (which are stored as characters). To convert value of f
to a basic numeric
type, we still need to assign the values output above using <-
and the as.numeric()
:
f <- levels(f)[f]
f <- as.numeric(f)
f
[1] 3.4 1.2 5.0
class(f)
[1] "numeric"
Examining objects
R provides many functions to examine features of vectors, matrices, and other objects. A few of the most common and useful are as follows:
class()
- what kind of object is it (high-level)?typeof()
- what is the object’s data type (low-level)?mode()
- returns the storage mode of an object (this will give the class of the elements of a vector or matrix).length()
- how long is it/how many elements does it contain (one-dimensional objects)?dim()
- what are its dimensions (two-dimensional objects)?attributes()
- does it have metadata?str()
- what is the structure of the object?head()
andtail()
- what does the data look like (provide a sample from the beginning or end of the object)?
Here are a couple of examples:
#--------------------------------------------------
# Example 1 -- a character variable object
x <- "dataset"
# These return the same information for character variables
typeof(x)
[1] "character"
class(x)
[1] "character"
mode(x)
[1] "character"
# Simple objects do not have attributes by default
attributes(x)
NULL
#--------------------------------------------------
# Example 2 -- an integer vector object
y <- 1:100
y
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
# These provide slightly different information about numeric objects
class(y)
[1] "integer"
typeof(y)
[1] "integer"
mode(y)
[1] "numeric"
# Vectors have length (number of elements) but not dimensions (reserved for 2-dimensional objects) or attributes (more complex objects)
length(y)
[1] 100
dim(y)
NULL
attributes(y)
NULL
#--------------------------------------------------
# Example 3 -- a numeric vector object
z <- as.numeric(y)
z
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
# These provide slightly different information about numeric objects
class(z)
[1] "numeric"
typeof(z)
[1] "double"
mode(z)
[1] "numeric"
# Setting dimensions converts z to a matrix, which changes the output of the functions
dim(z) <- c(10,10)
z
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
[7,] 7 17 27 37 47 57 67 77 87 97
[8,] 8 18 28 38 48 58 68 78 88 98
[9,] 9 19 29 39 49 59 69 79 89 99
[10,] 10 20 30 40 50 60 70 80 90 100
# These provide slightly different information about numeric objects
class(z)
[1] "matrix" "array"
typeof(z)
[1] "double"
mode(z)
[1] "numeric"
# Now z is a more complex object, with dimensions and attributes
length(z)
[1] 100
dim(z)
[1] 10 10
attributes(z)
$dim
[1] 10 10
# Head and tail produce the requested number of elements (vectors) or rows (2-dimensional objects) from the beginning or end, respectively (6 rows by default)
head(z)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
tail(z,3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[8,] 8 18 28 38 48 58 68 78 88 98
[9,] 9 19 29 39 49 59 69 79 89 99
[10,] 10 20 30 40 50 60 70 80 90 100
typeof(z)
[1] "double"
length(z)
[1] 100
class(z)
[1] "matrix" "array"
str(z) # stands for "structure" of an object
num [1:10, 1:10] 1 2 3 4 5 6 7 8 9 10 ...
head(z) # returns the first 6 elements of an object
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
Representing Data in R
You have a vector representing levels of exercise undertaken by 5 subjects
“l”, “n”, “n”, “i”, “l”
where n = none, l = light, i = intense
What is the best way to represent this in R?
Here are some options:
exercise <- c("l", "n", "n", "i", "l") # (a) exercise <- factor(c("l", "n", "n", "i", "l"), ordered = TRUE) # (b) exercise < -factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = FALSE) # (c) exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE) # (d)
Solution
The correct solution is (d). The data is presented as a categorical variable with one of three values that have a clear order. Thus We want to store the data as an ordered factor:
exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE) exercise
[1] l n n i l Levels: n < l < i
Missing data and special values
R supports both missing data and special values in data structures.
Missing Data
Missing data is represented as NA
(Not Available) and can be used for all the vector types that we have covered, though the NA
is displayed differently for factors:
# numeric
y1 <- c(0.5, NA, 0.7)
y1
[1] 0.5 NA 0.7
# integer
y2 <- c(1L, 2L, NA)
y2
[1] 1 2 NA
# logical
y3 <- c(TRUE, FALSE, NA)
y3
[1] TRUE FALSE NA
# character
y4 <- c("a", NA, "c", "d", "e")
y4
[1] "a" NA "c" "d" "e"
# complex
y5 <- c(1+5i, 2-3i, NA)
y5
[1] 1+5i 2-3i NA
# factor
y6 <- factor(y4)
y6
[1] a <NA> c d e
Levels: a c d e
The function is.na()
indicates which elements in a vector contain missing data by returning a logical vector with the same number of elements (TRUE
for NA
, FALSE
for other values):
x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
[1] FALSE TRUE FALSE FALSE TRUE
is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE
The function anyNA()
returns TRUE
if the vector contains any missing values:
anyNA(x)
[1] TRUE
anyNA(y)
[1] FALSE
Many functions will not work correctly if given as input an object that contain NA
s by default. Take sum()
for example:
z <- c(1,1,2,3,5,NA,13)
sum(z)
[1] NA
The presence of any NA
values in the input result in the function returning NA
. If you get this result from a function, it is worth checking the help file (e.g. ?sum
). Often the functions will include an argument na.rm
that can be used to exclude NA
values from analysis. sum()
has this argument, but it is set to FALSE
by default:
sum(z, na.rm = TRUE)
[1] 25
Other Special Values
Inf
is how R represents infinite values. You can have either positive or negative infinity.
1/0
[1] Inf
-1/0
[1] -Inf
10 * Inf
[1] Inf
1/Inf
[1] 0
Inf
is generally treated as a real value however, and is not easily removed by arguments like na.rm
:
m <- c(1,2,Inf)
sum(m)
[1] Inf
sum(m, na.rm=T)
[1] Inf
NaN
means “Not a Number”. It is an undefined value and used to represent the result of mathematical operations that are undefined. However, it can still be a placeholder in a numeric vector.
0/0
[1] NaN
2 * NaN
[1] NaN
Inf * NaN
[1] NaN
n <- c(1, 2, NaN)
typeof(n)
[1] "double"
The behavior of functions like sum()
treat NaN
values essentially the same way they treat NA
values in most situations, and the na.rm
will also filter out these values:
sum(n)
[1] NaN
sum(n, na.rm = TRUE)
[1] 3
You will run into the occasional function that differentiates between NA
and NaN
, so be aware that they may produce different behavior under some circumstances.
Key Points
R’s basic data types are character, numeric, integer, complex, and logical.
R’s data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).
Factors are used to represent categorical data.
Factors can be ordered or unordered.
Some R functions have special methods for handling factors.
The function
dim
gives the dimensions of a data structure.