R Data Types -- In-Depth

Overview

Class Date: 8/29/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 min

Questions

What are the basic data types in R?

What are factors and how do they differ from other data types?

How is missing data represented in R?

How is infinity represented in R?

Objectives

Understand basic data types in R and how these data types are used in data structures.

Understand the structure and properties of factors.

Be able to explain the difference between ordered and unordered factors.

Be aware of some of the problems encountered when using factors.

Understand missing data and other special values (e.g. infinity).

On Your Own

A more in-depth look at Data Types in R

To make the best of the R language, you’ll need a thorough understanding of the basic data types and data structures and how to operate on them.

Data structures are important to understand because these are the primary objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

All variables in R–basic data types, vectors, matrices, data frames, lists–are objects. Objects have a variety of attributes that define the way they interact with functions, other objects, and operators. Understanding how different classes of objects and their attributes are organized is essential to working within the R environment.

R has 6 basic data types. In addition to the five listed below, there is also raw which is rarely used and will not be discussed in this course.

character: "a", "swc", "MCB 585" (the “quotes” define a character string and allow spaces to be included)
numeric: 2, 15.5 (real or decimal numbers; aka “doubles”)
integer: -14L, 2L (the L tells R to store this as an integer rather than a numeric value)
logical: TRUE, FALSE
complex: 1+4i (complex numbers with real and imaginary parts)

We discussed the three most common data types In Class–characters, numeric, and logical. Now let’s take a closer look at the remaining two.

Integers

During the In Class segment, we discussed three primary data types: character, numeric, and logical, because these are the most commonly used. Technically, numeric is a data type category that contains both doubles (double precision, or decimal, numbers) and integers, though often numeric is used as short-hand to refer to doubles. If you wanted to explicitly create integers–the set of 0, all whole numbers (1, 2, 3, …), and their negative inverses (-1, -2, -3, …)–you need to add an L to the numeric value:

x1 <- 1L
class(x1)

[1] "integer"

Note that if you are not explicit with the L, R will convert your integer into a numeric:

x2 <- x1 + 4L
class(x2)

[1] "integer"

x3 <- x1 + 4
class(x3)

[1] "numeric"

x4 <- x1 + 4.4
class(x4)

[1] "numeric"

x5 <- 2 * x1
class(x5)

[1] "numeric"

x6 <- 2L * x1
class(x6)

[1] "integer"

R will also convert to numeric if you try and force a decimal value to be an integer by including the L:

not.an.integer <- 1.2L
class(not.an.integer)

[1] "numeric"

Complex numbers

Complex numbers are numbers that consist of two parts, \(a + bi\), in which \(i^2 = -1\). Because the solution to this equation, \(i = \sqrt{-1}\), is not a real number, \(i\) is referred to as an imaginary number. In complex numbers, \(a\) is the real part and \(b\) is the imaginary part.

Complex numbers are represented in R using the complex data type. These can be created directly or using the complex() function:

c1 <- 4+3i
c1

[1] 4+3i

class(c1)

[1] "complex"

c2 <- complex(real = 3, imaginary = 2.5)
c2

[1] 3+2.5i

class(c2)

[1] "complex"

Note that you have to define \(b\) explicitly, and that it must be right next to the i (bi), not multiplied by the i (b*i):

c3 <- 1+i # incorrect

Error in eval(expr, envir, enclos): object 'i' not found

c3 <- 1+1*i # incorrect

Error in eval(expr, envir, enclos): object 'i' not found

c3 <- 1+1i # correct

Complex numbers can be manipulated with operators and functions:

2*c3

[1] 2+2i

c1 + c3

[1] 5+4i

c2^2

[1] 2.75+15i

sum(c1, c3)

[1] 5+4i

Adding numeric values to a complex variable results in a complex variable:

c4 <- c1 + 10
c4

[1] 14+3i

class(c4)

[1] "complex"

Both integer or complex data types are necessary for certain types of analysis. Know they exist and how to deal with them if they come up in one of your applications.

Converting between data types

You can shift between data types using the as. functions:

n1 <- 42
n1.as.char <- as.character(n1)
n1.as.char

[1] "42"

# ... now check the class of the new variable:
class(n1.as.char)

[1] "character"

# you can't add numbers to characters!
num1.as.char + 4

Error in eval(expr, envir, enclos): object 'num1.as.char' not found

It only works if the conversion makes sense in context. R also does not understand non-numeric references to numbers (e.g. using "two" to refer to the number 2).

# this one works:
as.numeric("44")

[1] 44

# these don't
as.numeric("hello!")

Warning: NAs introduced by coercion

[1] NA

as.numeric("forty-four")

Warning: NAs introduced by coercion

[1] NA

Note that when one of the as. functions throws an error, it doesn’t simply fail to return a variable. Instead it throws an warning message and assigns an NA value to the variable. NA is one of several special values that represents missing data, or “Not Available”. We will discuss these special characters in more detail later in this lesson.

Sometimes these functions can have unintended consequences. When we apply the as.integer() function to a numeric, it automatically rounds decimal number “down” (i.e. “toward 0”) to the nearest integer:

as.integer(1)

[1] 1

as.integer(0.1)

[1] 0

as.integer(0.9)

[1] 0

as.integer(1.1)

[1] 1

as.integer(-1.1)

[1] -1

as.integer(-0.9)

[1] 0

The as.logical() function will take 0 as FALSE and any non-zero numeric as TRUE. It will throw an error for any character input that is not a common spelling of TRUE or FALSE. Note that capitalization matters here: T works, but t does not.

as.logical(0)

[1] FALSE

as.logical(1)

[1] TRUE

as.logical(10)

[1] TRUE

as.logical(0.01)

[1] TRUE

as.logical(-4)

[1] TRUE

as.logical("TRUE")

[1] TRUE

as.logical("True")

[1] TRUE

as.logical("tRUE")

[1] NA

as.logical("T")

[1] TRUE

as.logical("t")

[1] NA

as.logical("false")

[1] FALSE

There is a basic underlying hierarchy to data types that runs from more general (character) to more specific (integer), and conversions only tend to work in the more general direction. For additional discussion of data types in R (and many other topics), check out the Vectors chapter in the book R for Data Science, by Garrett Grolemund and Hadley Wickham.

You can also coerce vectors and matrices using the same functions as.integer().

num.vec <- seq(0.1,1,0.1)
num.vec

 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

class(num.vec)

[1] "numeric"

int.vec <- as.integer(num.vec)
int.vec

 [1] 0 0 0 0 0 0 0 0 0 1

class(int.vec)

[1] "integer"

Factors

Factors are technically a data structure but function as a special data type in R. Factors are primarily used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors look (and often behave) like character vectors, but assuming that they are character vectors can lead to unexpected behavior. Factors are actually an odd hybrid of integers and characters under the hood, and you need to be careful when treating them like character strings.

Factors have three essential properties:

A vector of integers.
A set of labels defining a name corresponding to each integer.
A defined order for the labels (for ordered factors).

The integer defines the value of each element in the factor, the label indicates what that value means, and the order defines the relationship between the values.

Once created, each element of a factor can only contain a pre-defined set values, known as levels. Labels and levels essentially refer to the same thing and the terms can be used interchangeably, for the most part. The labels variable in the factor() function defines the levels attribute in the created factor. By default, R sorts levels in alphabetical order. For instance, let’s use the factor() command to create a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))
sex

[1] male   female female male  
Levels: female male

Now compare this to a similar character vector:

sex.char <- c("male", "female", "female", "male")
sex.char

[1] "male"   "female" "female" "male"

Note that the elements of sex.char have quotation marks, while the object sex had a list of levels.

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m in the alphabet, even though the first element in this vector is "male"). You can check the current order using the function levels(), and check the number of levels using nlevels():

levels(sex)

[1] "female" "male"

nlevels(sex)

[1] 2

Use the str() function to see both the levels (“female”, “male”) and the underlying integer representation of the factor:

str(sex)

 Factor w/ 2 levels "female","male": 2 1 1 2

Note that the order in which the levels appear defines which level corresponds to which integer number.

The major functional difference between character and factor objects is that the elements of the character vector only have the inherent order defined by their values (e.g. alphabetical). Sometimes, the order of the factor elements does not matter; other times you might want to specify the order because it is meaningful. For instance, "low", "medium", "high" as elements of a character vector have the implicit alphabetical order:

"high" < "low" < "medium"

while the more meaningful conceptual ordering is:

"low" < "medium" < "high"

By default, factor levels take on the alphabetical order:

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)

[1] "high"   "low"    "medium"

Adding the levels argument to factor() function defines the level order if you do not want it to be alphabetical:

food <- factor(food, levels = c("low", "medium", "high"))
levels(food)

[1] "low"    "medium" "high"

Note that “relative” operations do not work with factors unless they are ordered. The function min() returns the minimum value in an integer vector, but not for either the food or sex factor vectors:

n1 <- 1:10
min(n1)

[1] 1

min(food)

Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors

min(sex)

Error in Summary.factor(structure(c(2L, 1L, 1L, 2L), levels = c("female", : 'min' not meaningful for factors

Even though we entered the levels of food in a specific order, R does not assume that the order that we entered the strings implies relative value unless we make it explicit. We can specify relative values for levels using the ordered argument in the factor() (which defaults to FALSE if unspecified):

food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)
levels(food)

[1] "low"    "medium" "high"

min(food)

[1] low
Levels: low < medium < high

Now class() and str() both reflect the order and relative values of the levels, respectively:

class(sex)

[1] "factor"

str(sex)

 Factor w/ 2 levels "female","male": 2 1 1 2

class(food)

[1] "ordered" "factor"

str(food)

 Ord.factor w/ 3 levels "low"<"medium"<..: 1 3 2 3 1 2 3

Note that numeric operations still do not work, so you can’t assume factors behave like integers either:

food[1] + food[2]

Warning in Ops.ordered(food[1], food[2]): '+' is not meaningful for ordered
factors

[1] NA

In R’s memory, these factors are represented by numbers (1, 2, 3). This is better than using simple integer labels because factors are self describing:

"low", "medium", and "high"” is more descriptive than 1, 2, 3. Which is low? You would not necessarily be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels, such as a factor vector containing the unique patient identifiers for a data set containing several thousand patients.

Adding elements to factors

For the most part, adding elements to a factor works similarly to adding elements to other types of vectors:

food[8] <- "low" 
food

[1] low    high   medium high   low    medium high   low   
Levels: low < medium < high

Note that skipping an element introduces NAs in the intermediate elements. More on this later…

food[10] <- "high" 
food

 [1] low    high   medium high   low    medium high   low    <NA>   high  
Levels: low < medium < high

The exception occurs when you try to add a new level that is not defined within the factor:

food[11] <- "very high"

Warning in `[<-.factor`(`*tmp*`, 11, value = "very high"): invalid factor
level, NA generated

food

 [1] low    high   medium high   low    medium high   low    <NA>   high  
[11] <NA>  
Levels: low < medium < high

Since the requested entry is not a valid level, R generates an NA instead. To add this element, we first must redefine the valid list of levels. You could use the factor() function with “very high” included in the levels argument, but a simply way is to assign the new value using the levels() function

levels(food) <- c(levels(food), "very high")
levels(food)

[1] "low"       "medium"    "high"      "very high"

Here we just used the c() function to append a new value onto the existing list of levels. Now we can assign the new element:

food[11] <- "very high"
food

 [1] low       high      medium    high      low       medium    high     
 [8] low       <NA>      high      very high
Levels: low < medium < high < very high

Converting Factors

Converting from a factor to a number can cause problems:

f <- factor(c(3.4, 1.2, 5))
as.numeric(f)

[1] 2 1 3

This does not behave as expected (and there is no warning). The reason is that the apparent numeric values are actually stored as integers (2, 1, 3) with labels ("3.4", "1.2", "5"). R uses the integer value when trying to perform the as.numeric() function.

The recommended way is to use the integer vector to index the factor levels:

levels(f)[f]

[1] "3.4" "1.2" "5"

Remember that the factor really consists of two elements:

The ordered integer list: 2, 1, 3
The “key” indicating which integer corresponds to which level: 1 = "1.2", 2 = "3.4", 3 = "5"

To break down the levels(f)[f]:

First we grab the list of levels using levels(f), which outputs a character vector: "1.2" "3.4" "5".
Next we index this list with [f]. Because the index requests a numeric representation of the factor f, R replaces the [f] with [c(2,1,3)] (the integer portion of the factor object).
R returns the elements of the character list in (1) with the order indicated by the integer list in (2).

Note that the output from levels(f)[f] is actually a character vector, because we indexed the list of levels (which are stored as characters). To convert value of f to a basic numeric type, we still need to assign the values output above using <- and the as.numeric():

f <- levels(f)[f]
f <- as.numeric(f)
f

[1] 3.4 1.2 5.0

class(f)

[1] "numeric"

Examining objects

R provides many functions to examine features of vectors, matrices, and other objects. A few of the most common and useful are as follows:

class() - what kind of object is it (high-level)?
typeof() - what is the object’s data type (low-level)?
mode() - returns the storage mode of an object (this will give the class of the elements of a vector or matrix).
length() - how long is it/how many elements does it contain (one-dimensional objects)?
dim() - what are its dimensions (two-dimensional objects)?
attributes() - does it have metadata?
str() - what is the structure of the object?
head() and tail() - what does the data look like (provide a sample from the beginning or end of the object)?

Here are a couple of examples:

#--------------------------------------------------
# Example 1 -- a character variable object
x <- "dataset"

# These return the same information for character variables
typeof(x)

[1] "character"

class(x)

[1] "character"

mode(x)

[1] "character"

# Simple objects do not have attributes by default
attributes(x) 

NULL

#--------------------------------------------------
# Example 2 -- an integer vector object
y <- 1:100
y

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

# These provide slightly different information about numeric objects
class(y)

[1] "integer"

typeof(y)

[1] "integer"

mode(y)

[1] "numeric"

# Vectors have length (number of elements) but not dimensions (reserved for 2-dimensional objects) or attributes (more complex objects)
length(y)

[1] 100

dim(y)

NULL

attributes(y)

NULL

#--------------------------------------------------
# Example 3 -- a numeric vector object
z <- as.numeric(y)
z

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

# These provide slightly different information about numeric objects
class(z)

[1] "numeric"

typeof(z)

[1] "double"

mode(z)

[1] "numeric"

# Setting dimensions converts z to a matrix, which changes the output of the functions
dim(z) <- c(10,10)
z

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1   11   21   31   41   51   61   71   81    91
 [2,]    2   12   22   32   42   52   62   72   82    92
 [3,]    3   13   23   33   43   53   63   73   83    93
 [4,]    4   14   24   34   44   54   64   74   84    94
 [5,]    5   15   25   35   45   55   65   75   85    95
 [6,]    6   16   26   36   46   56   66   76   86    96
 [7,]    7   17   27   37   47   57   67   77   87    97
 [8,]    8   18   28   38   48   58   68   78   88    98
 [9,]    9   19   29   39   49   59   69   79   89    99
[10,]   10   20   30   40   50   60   70   80   90   100

# These provide slightly different information about numeric objects
class(z)

[1] "matrix" "array"

typeof(z)

[1] "double"

mode(z)

[1] "numeric"

# Now z is a more complex object, with dimensions and attributes
length(z)

[1] 100

dim(z)

[1] 10 10

attributes(z)

$dim
[1] 10 10

# Head and tail produce the requested number of elements (vectors) or rows (2-dimensional objects) from the beginning or end, respectively (6 rows by default)
head(z)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1   11   21   31   41   51   61   71   81    91
[2,]    2   12   22   32   42   52   62   72   82    92
[3,]    3   13   23   33   43   53   63   73   83    93
[4,]    4   14   24   34   44   54   64   74   84    94
[5,]    5   15   25   35   45   55   65   75   85    95
[6,]    6   16   26   36   46   56   66   76   86    96

tail(z,3)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [8,]    8   18   28   38   48   58   68   78   88    98
 [9,]    9   19   29   39   49   59   69   79   89    99
[10,]   10   20   30   40   50   60   70   80   90   100

typeof(z)

[1] "double"

length(z)

[1] 100

class(z)

[1] "matrix" "array"

str(z) # stands for "structure" of an object

 num [1:10, 1:10] 1 2 3 4 5 6 7 8 9 10 ...

head(z) # returns the first 6 elements of an object

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1   11   21   31   41   51   61   71   81    91
[2,]    2   12   22   32   42   52   62   72   82    92
[3,]    3   13   23   33   43   53   63   73   83    93
[4,]    4   14   24   34   44   54   64   74   84    94
[5,]    5   15   25   35   45   55   65   75   85    95
[6,]    6   16   26   36   46   56   66   76   86    96

Representing Data in R

You have a vector representing levels of exercise undertaken by 5 subjects

“l”, “n”, “n”, “i”, “l”

where n = none, l = light, i = intense

What is the best way to represent this in R?

Here are some options:
exercise <- c("l", "n", "n", "i", "l") # (a)
exercise <- factor(c("l", "n", "n", "i", "l"), ordered = TRUE) # (b)
exercise < -factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = FALSE) # (c)
exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE) # (d)
Solution

The correct solution is (d). The data is presented as a categorical variable with one of three values that have a clear order. Thus We want to store the data as an ordered factor:
exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE)
exercise
[1] l n n i l
Levels: n < l < i

Missing data and special values

R supports both missing data and special values in data structures.

Missing Data

Missing data is represented as NA (Not Available) and can be used for all the vector types that we have covered, though the NA is displayed differently for factors:

# numeric
y1 <- c(0.5, NA, 0.7)
y1

[1] 0.5  NA 0.7

# integer
y2 <- c(1L, 2L, NA)
y2

[1]  1  2 NA

# logical
y3 <- c(TRUE, FALSE, NA)
y3

[1]  TRUE FALSE    NA

# character
y4 <- c("a", NA, "c", "d", "e")
y4

[1] "a" NA  "c" "d" "e"

# complex
y5 <- c(1+5i, 2-3i, NA)
y5

[1] 1+5i 2-3i   NA

# factor
y6 <- factor(y4)
y6

[1] a    <NA> c    d    e   
Levels: a c d e

The function is.na() indicates which elements in a vector contain missing data by returning a logical vector with the same number of elements (TRUE for NA, FALSE for other values):

x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)

[1] FALSE  TRUE FALSE FALSE  TRUE

is.na(y)

[1] FALSE FALSE FALSE FALSE FALSE

The function anyNA() returns TRUE if the vector contains any missing values:

anyNA(x)

[1] TRUE

anyNA(y)

[1] FALSE

Many functions will not work correctly if given as input an object that contain NAs by default. Take sum() for example:

z <- c(1,1,2,3,5,NA,13)
sum(z)

[1] NA

The presence of any NA values in the input result in the function returning NA. If you get this result from a function, it is worth checking the help file (e.g. ?sum). Often the functions will include an argument na.rm that can be used to exclude NA values from analysis. sum() has this argument, but it is set to FALSE by default:

sum(z, na.rm = TRUE)

[1] 25

Other Special Values

Inf is how R represents infinite values. You can have either positive or negative infinity.

1/0

[1] Inf

-1/0

[1] -Inf

10 * Inf

[1] Inf

1/Inf

[1] 0

Inf is generally treated as a real value however, and is not easily removed by arguments like na.rm:

m <- c(1,2,Inf)
sum(m)

[1] Inf

sum(m, na.rm=T)

[1] Inf

NaN means “Not a Number”. It is an undefined value and used to represent the result of mathematical operations that are undefined. However, it can still be a placeholder in a numeric vector.

0/0

[1] NaN

2 * NaN

[1] NaN

Inf * NaN

[1] NaN

n <- c(1, 2, NaN)
typeof(n)

[1] "double"

The behavior of functions like sum() treat NaN values essentially the same way they treat NA values in most situations, and the na.rm will also filter out these values:

sum(n)

[1] NaN

sum(n, na.rm = TRUE)

[1] 3

You will run into the occasional function that differentiates between NA and NaN, so be aware that they may produce different behavior under some circumstances.

Key Points

R’s basic data types are character, numeric, integer, complex, and logical.

R’s data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).

Factors are used to represent categorical data.

Factors can be ordered or unordered.

Some R functions have special methods for handling factors.

The function dim gives the dimensions of a data structure.

previous episode

MCB 585: Multidisciplinary/Quantitative Approaches to Solving Biological Problems

next episode