R Data Types -- In-Depth

Overview

Class Date: 8/29/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 min
Questions
  • What are the basic data types in R?

  • What are factors and how do they differ from other data types?

  • How is missing data represented in R?

  • How is infinity represented in R?

Objectives
  • Understand basic data types in R and how these data types are used in data structures.

  • Understand the structure and properties of factors.

  • Be able to explain the difference between ordered and unordered factors.

  • Be aware of some of the problems encountered when using factors.

  • Understand missing data and other special values (e.g. infinity).


On Your Own

A more in-depth look at Data Types in R

To make the best of the R language, you’ll need a thorough understanding of the basic data types and data structures and how to operate on them.

Data structures are important to understand because these are the primary objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

All variables in R–basic data types, vectors, matrices, data frames, lists–are objects. Objects have a variety of attributes that define the way they interact with functions, other objects, and operators. Understanding how different classes of objects and their attributes are organized is essential to working within the R environment.

R has 6 basic data types. In addition to the five listed below, there is also raw which is rarely used and will not be discussed in this course.

We discussed the three most common data types In Classcharacters, numeric, and logical. Now let’s take a closer look at the remaining two.

 

Integers

During the In Class segment, we discussed three primary data types: character, numeric, and logical, because these are the most commonly used. Technically, numeric is a data type category that contains both doubles (double precision, or decimal, numbers) and integers, though often numeric is used as short-hand to refer to doubles. If you wanted to explicitly create integers–the set of 0, all whole numbers (1, 2, 3, …), and their negative inverses (-1, -2, -3, …)–you need to add an L to the numeric value:

x1 <- 1L
class(x1)
[1] "integer"

 

Note that if you are not explicit with the L, R will convert your integer into a numeric:

x2 <- x1 + 4L
class(x2)
[1] "integer"
x3 <- x1 + 4
class(x3)
[1] "numeric"
x4 <- x1 + 4.4
class(x4)
[1] "numeric"
x5 <- 2 * x1
class(x5)
[1] "numeric"
x6 <- 2L * x1
class(x6)
[1] "integer"

 

R will also convert to numeric if you try and force a decimal value to be an integer by including the L:

not.an.integer <- 1.2L
class(not.an.integer)
[1] "numeric"

 

Complex numbers

Complex numbers are numbers that consist of two parts, \(a + bi\), in which \(i^2 = -1\). Because the solution to this equation, \(i = \sqrt{-1}\), is not a real number, \(i\) is referred to as an imaginary number. In complex numbers, \(a\) is the real part and \(b\) is the imaginary part.

Complex numbers are represented in R using the complex data type. These can be created directly or using the complex() function:

c1 <- 4+3i
c1
[1] 4+3i
class(c1)
[1] "complex"
c2 <- complex(real = 3, imaginary = 2.5)
c2
[1] 3+2.5i
class(c2)
[1] "complex"

 

Note that you have to define \(b\) explicitly, and that it must be right next to the i (bi), not multiplied by the i (b*i):

c3 <- 1+i # incorrect
Error in eval(expr, envir, enclos): object 'i' not found
c3 <- 1+1*i # incorrect
Error in eval(expr, envir, enclos): object 'i' not found
c3 <- 1+1i # correct

 

Complex numbers can be manipulated with operators and functions:

2*c3
[1] 2+2i
c1 + c3
[1] 5+4i
c2^2
[1] 2.75+15i
sum(c1, c3)
[1] 5+4i

 

Adding numeric values to a complex variable results in a complex variable:

c4 <- c1 + 10
c4
[1] 14+3i
class(c4)
[1] "complex"

 

Both integer or complex data types are necessary for certain types of analysis. Know they exist and how to deal with them if they come up in one of your applications.

 

Converting between data types

You can shift between data types using the as. functions:

n1 <- 42
n1.as.char <- as.character(n1)
n1.as.char
[1] "42"
# ... now check the class of the new variable:
class(n1.as.char)
[1] "character"
# you can't add numbers to characters!
num1.as.char + 4
Error in eval(expr, envir, enclos): object 'num1.as.char' not found

 

It only works if the conversion makes sense in context. R also does not understand non-numeric references to numbers (e.g. using "two" to refer to the number 2).

# this one works:
as.numeric("44")
[1] 44
# these don't
as.numeric("hello!")
Warning: NAs introduced by coercion
[1] NA
as.numeric("forty-four")
Warning: NAs introduced by coercion
[1] NA

 

Note that when one of the as. functions throws an error, it doesn’t simply fail to return a variable. Instead it throws an warning message and assigns an NA value to the variable. NA is one of several special values that represents missing data, or “Not Available”. We will discuss these special characters in more detail later in this lesson.

Sometimes these functions can have unintended consequences. When we apply the as.integer() function to a numeric, it automatically rounds decimal number “down” (i.e. “toward 0”) to the nearest integer:

as.integer(1)
[1] 1
as.integer(0.1)
[1] 0
as.integer(0.9)
[1] 0
as.integer(1.1)
[1] 1
as.integer(-1.1)
[1] -1
as.integer(-0.9)
[1] 0

 

The as.logical() function will take 0 as FALSE and any non-zero numeric as TRUE. It will throw an error for any character input that is not a common spelling of TRUE or FALSE. Note that capitalization matters here: T works, but t does not.

as.logical(0)
[1] FALSE
as.logical(1)
[1] TRUE
as.logical(10)
[1] TRUE
as.logical(0.01)
[1] TRUE
as.logical(-4)
[1] TRUE
as.logical("TRUE")
[1] TRUE
as.logical("True")
[1] TRUE
as.logical("tRUE")
[1] NA
as.logical("T")
[1] TRUE
as.logical("t")
[1] NA
as.logical("false")
[1] FALSE

 

There is a basic underlying hierarchy to data types that runs from more general (character) to more specific (integer), and conversions only tend to work in the more general direction. For additional discussion of data types in R (and many other topics), check out the Vectors chapter in the book R for Data Science, by Garrett Grolemund and Hadley Wickham.

You can also coerce vectors and matrices using the same functions as.integer().

num.vec <- seq(0.1,1,0.1)
num.vec
 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
class(num.vec)
[1] "numeric"
int.vec <- as.integer(num.vec)
int.vec
 [1] 0 0 0 0 0 0 0 0 0 1
class(int.vec)
[1] "integer"

Factors

Factors are technically a data structure but function as a special data type in R. Factors are primarily used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors look (and often behave) like character vectors, but assuming that they are character vectors can lead to unexpected behavior. Factors are actually an odd hybrid of integers and characters under the hood, and you need to be careful when treating them like character strings.

Factors have three essential properties:

The integer defines the value of each element in the factor, the label indicates what that value means, and the order defines the relationship between the values.

Once created, each element of a factor can only contain a pre-defined set values, known as levels. Labels and levels essentially refer to the same thing and the terms can be used interchangeably, for the most part. The labels variable in the factor() function defines the levels attribute in the created factor. By default, R sorts levels in alphabetical order. For instance, let’s use the factor() command to create a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))
sex
[1] male   female female male  
Levels: female male

 

Now compare this to a similar character vector:

sex.char <- c("male", "female", "female", "male")
sex.char
[1] "male"   "female" "female" "male"  

 

Note that the elements of sex.char have quotation marks, while the object sex had a list of levels.

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m in the alphabet, even though the first element in this vector is "male"). You can check the current order using the function levels(), and check the number of levels using nlevels():

levels(sex)
[1] "female" "male"  
nlevels(sex)
[1] 2

 

Use the str() function to see both the levels (“female”, “male”) and the underlying integer representation of the factor:

str(sex)
 Factor w/ 2 levels "female","male": 2 1 1 2

 

Note that the order in which the levels appear defines which level corresponds to which integer number.

The major functional difference between character and factor objects is that the elements of the character vector only have the inherent order defined by their values (e.g. alphabetical). Sometimes, the order of the factor elements does not matter; other times you might want to specify the order because it is meaningful. For instance, "low", "medium", "high" as elements of a character vector have the implicit alphabetical order:

      "high" < "low" < "medium"

while the more meaningful conceptual ordering is:

      "low" < "medium" < "high"

By default, factor levels take on the alphabetical order:

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
[1] "high"   "low"    "medium"

 

Adding the levels argument to factor() function defines the level order if you do not want it to be alphabetical:

food <- factor(food, levels = c("low", "medium", "high"))
levels(food)
[1] "low"    "medium" "high"  

 

Note that “relative” operations do not work with factors unless they are ordered. The function min() returns the minimum value in an integer vector, but not for either the food or sex factor vectors:

n1 <- 1:10
min(n1)
[1] 1
min(food)
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors
min(sex)
Error in Summary.factor(structure(c(2L, 1L, 1L, 2L), levels = c("female", : 'min' not meaningful for factors

 

Even though we entered the levels of food in a specific order, R does not assume that the order that we entered the strings implies relative value unless we make it explicit. We can specify relative values for levels using the ordered argument in the factor() (which defaults to FALSE if unspecified):

food <- factor(food, levels = c("low", "medium", "high"), ordered = TRUE)
levels(food)
[1] "low"    "medium" "high"  
min(food)
[1] low
Levels: low < medium < high

 

Now class() and str() both reflect the order and relative values of the levels, respectively:

class(sex)
[1] "factor"
str(sex)
 Factor w/ 2 levels "female","male": 2 1 1 2
class(food)
[1] "ordered" "factor" 
str(food)
 Ord.factor w/ 3 levels "low"<"medium"<..: 1 3 2 3 1 2 3

 

Note that numeric operations still do not work, so you can’t assume factors behave like integers either:

food[1] + food[2]
Warning in Ops.ordered(food[1], food[2]): '+' is not meaningful for ordered
factors
[1] NA

 

In R’s memory, these factors are represented by numbers (1, 2, 3). This is better than using simple integer labels because factors are self describing:

"low", "medium", and "high"” is more descriptive than 1, 2, 3. Which is low? You would not necessarily be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels, such as a factor vector containing the unique patient identifiers for a data set containing several thousand patients.

 

Adding elements to factors

For the most part, adding elements to a factor works similarly to adding elements to other types of vectors:

food[8] <- "low" 
food
[1] low    high   medium high   low    medium high   low   
Levels: low < medium < high

 

Note that skipping an element introduces NAs in the intermediate elements. More on this later…

food[10] <- "high" 
food
 [1] low    high   medium high   low    medium high   low    <NA>   high  
Levels: low < medium < high

 

The exception occurs when you try to add a new level that is not defined within the factor:

food[11] <- "very high"
Warning in `[<-.factor`(`*tmp*`, 11, value = "very high"): invalid factor
level, NA generated
food
 [1] low    high   medium high   low    medium high   low    <NA>   high  
[11] <NA>  
Levels: low < medium < high

 

Since the requested entry is not a valid level, R generates an NA instead. To add this element, we first must redefine the valid list of levels. You could use the factor() function with “very high” included in the levels argument, but a simply way is to assign the new value using the levels() function

levels(food) <- c(levels(food), "very high")
levels(food)
[1] "low"       "medium"    "high"      "very high"

 

Here we just used the c() function to append a new value onto the existing list of levels. Now we can assign the new element:

food[11] <- "very high"
food
 [1] low       high      medium    high      low       medium    high     
 [8] low       <NA>      high      very high
Levels: low < medium < high < very high

 

Converting Factors

Converting from a factor to a number can cause problems:

f <- factor(c(3.4, 1.2, 5))
as.numeric(f)
[1] 2 1 3

 

This does not behave as expected (and there is no warning). The reason is that the apparent numeric values are actually stored as integers (2, 1, 3) with labels ("3.4", "1.2", "5"). R uses the integer value when trying to perform the as.numeric() function.

The recommended way is to use the integer vector to index the factor levels:

levels(f)[f]
[1] "3.4" "1.2" "5"  

 

Remember that the factor really consists of two elements:

To break down the levels(f)[f]:

  1. First we grab the list of levels using levels(f), which outputs a character vector: "1.2" "3.4" "5".
  2. Next we index this list with [f]. Because the index requests a numeric representation of the factor f, R replaces the [f] with [c(2,1,3)] (the integer portion of the factor object).
  3. R returns the elements of the character list in (1) with the order indicated by the integer list in (2).

Note that the output from levels(f)[f] is actually a character vector, because we indexed the list of levels (which are stored as characters). To convert value of f to a basic numeric type, we still need to assign the values output above using <- and the as.numeric():

f <- levels(f)[f]
f <- as.numeric(f)
f
[1] 3.4 1.2 5.0
class(f)
[1] "numeric"

Examining objects

R provides many functions to examine features of vectors, matrices, and other objects. A few of the most common and useful are as follows:

Here are a couple of examples:

#--------------------------------------------------
# Example 1 -- a character variable object
x <- "dataset"

# These return the same information for character variables
typeof(x)
[1] "character"
class(x)
[1] "character"
mode(x)
[1] "character"
# Simple objects do not have attributes by default
attributes(x) 
NULL
#--------------------------------------------------
# Example 2 -- an integer vector object
y <- 1:100
y
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
# These provide slightly different information about numeric objects
class(y)
[1] "integer"
typeof(y)
[1] "integer"
mode(y)
[1] "numeric"
# Vectors have length (number of elements) but not dimensions (reserved for 2-dimensional objects) or attributes (more complex objects)
length(y)
[1] 100
dim(y)
NULL
attributes(y)
NULL
#--------------------------------------------------
# Example 3 -- a numeric vector object
z <- as.numeric(y)
z
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
# These provide slightly different information about numeric objects
class(z)
[1] "numeric"
typeof(z)
[1] "double"
mode(z)
[1] "numeric"
# Setting dimensions converts z to a matrix, which changes the output of the functions
dim(z) <- c(10,10)
z
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1   11   21   31   41   51   61   71   81    91
 [2,]    2   12   22   32   42   52   62   72   82    92
 [3,]    3   13   23   33   43   53   63   73   83    93
 [4,]    4   14   24   34   44   54   64   74   84    94
 [5,]    5   15   25   35   45   55   65   75   85    95
 [6,]    6   16   26   36   46   56   66   76   86    96
 [7,]    7   17   27   37   47   57   67   77   87    97
 [8,]    8   18   28   38   48   58   68   78   88    98
 [9,]    9   19   29   39   49   59   69   79   89    99
[10,]   10   20   30   40   50   60   70   80   90   100
# These provide slightly different information about numeric objects
class(z)
[1] "matrix" "array" 
typeof(z)
[1] "double"
mode(z)
[1] "numeric"
# Now z is a more complex object, with dimensions and attributes
length(z)
[1] 100
dim(z)
[1] 10 10
attributes(z)
$dim
[1] 10 10
# Head and tail produce the requested number of elements (vectors) or rows (2-dimensional objects) from the beginning or end, respectively (6 rows by default)
head(z)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1   11   21   31   41   51   61   71   81    91
[2,]    2   12   22   32   42   52   62   72   82    92
[3,]    3   13   23   33   43   53   63   73   83    93
[4,]    4   14   24   34   44   54   64   74   84    94
[5,]    5   15   25   35   45   55   65   75   85    95
[6,]    6   16   26   36   46   56   66   76   86    96
tail(z,3)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [8,]    8   18   28   38   48   58   68   78   88    98
 [9,]    9   19   29   39   49   59   69   79   89    99
[10,]   10   20   30   40   50   60   70   80   90   100
typeof(z)
[1] "double"
length(z)
[1] 100
class(z)
[1] "matrix" "array" 
str(z) # stands for "structure" of an object
 num [1:10, 1:10] 1 2 3 4 5 6 7 8 9 10 ...
head(z) # returns the first 6 elements of an object
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1   11   21   31   41   51   61   71   81    91
[2,]    2   12   22   32   42   52   62   72   82    92
[3,]    3   13   23   33   43   53   63   73   83    93
[4,]    4   14   24   34   44   54   64   74   84    94
[5,]    5   15   25   35   45   55   65   75   85    95
[6,]    6   16   26   36   46   56   66   76   86    96

 

Representing Data in R

You have a vector representing levels of exercise undertaken by 5 subjects

        “l”, “n”, “n”, “i”, “l”

        where n = none, l = light, i = intense

What is the best way to represent this in R?

Here are some options:

exercise <- c("l", "n", "n", "i", "l") # (a)
exercise <- factor(c("l", "n", "n", "i", "l"), ordered = TRUE) # (b)
exercise < -factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = FALSE) # (c)
exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE) # (d)

Solution

The correct solution is (d). The data is presented as a categorical variable with one of three values that have a clear order. Thus We want to store the data as an ordered factor:

exercise <- factor(c("l", "n", "n", "i", "l"), levels = c("n", "l", "i"), ordered = TRUE)
exercise
[1] l n n i l
Levels: n < l < i

Missing data and special values

R supports both missing data and special values in data structures.

 

Missing Data

Missing data is represented as NA (Not Available) and can be used for all the vector types that we have covered, though the NA is displayed differently for factors:

# numeric
y1 <- c(0.5, NA, 0.7)
y1
[1] 0.5  NA 0.7
# integer
y2 <- c(1L, 2L, NA)
y2
[1]  1  2 NA
# logical
y3 <- c(TRUE, FALSE, NA)
y3
[1]  TRUE FALSE    NA
# character
y4 <- c("a", NA, "c", "d", "e")
y4
[1] "a" NA  "c" "d" "e"
# complex
y5 <- c(1+5i, 2-3i, NA)
y5
[1] 1+5i 2-3i   NA
# factor
y6 <- factor(y4)
y6
[1] a    <NA> c    d    e   
Levels: a c d e

 

The function is.na() indicates which elements in a vector contain missing data by returning a logical vector with the same number of elements (TRUE for NA, FALSE for other values):

x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
[1] FALSE  TRUE FALSE FALSE  TRUE
is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE

 

The function anyNA() returns TRUE if the vector contains any missing values:

anyNA(x)
[1] TRUE
anyNA(y)
[1] FALSE

 

Many functions will not work correctly if given as input an object that contain NAs by default. Take sum() for example:

z <- c(1,1,2,3,5,NA,13)
sum(z)
[1] NA

 

The presence of any NA values in the input result in the function returning NA. If you get this result from a function, it is worth checking the help file (e.g. ?sum). Often the functions will include an argument na.rm that can be used to exclude NA values from analysis. sum() has this argument, but it is set to FALSE by default:

sum(z, na.rm = TRUE)
[1] 25

 

Other Special Values

Inf is how R represents infinite values. You can have either positive or negative infinity.

1/0
[1] Inf
-1/0
[1] -Inf
10 * Inf
[1] Inf
1/Inf
[1] 0

 

Inf is generally treated as a real value however, and is not easily removed by arguments like na.rm:

m <- c(1,2,Inf)
sum(m)
[1] Inf
sum(m, na.rm=T)
[1] Inf

 

NaN means “Not a Number”. It is an undefined value and used to represent the result of mathematical operations that are undefined. However, it can still be a placeholder in a numeric vector.

0/0
[1] NaN
2 * NaN
[1] NaN
Inf * NaN
[1] NaN
n <- c(1, 2, NaN)
typeof(n)
[1] "double"

 

The behavior of functions like sum() treat NaN values essentially the same way they treat NA values in most situations, and the na.rm will also filter out these values:

sum(n)
[1] NaN
sum(n, na.rm = TRUE)
[1] 3

 

You will run into the occasional function that differentiates between NA and NaN, so be aware that they may produce different behavior under some circumstances.


Key Points

  • R’s basic data types are character, numeric, integer, complex, and logical.

  • R’s data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).

  • Factors are used to represent categorical data.

  • Factors can be ordered or unordered.

  • Some R functions have special methods for handling factors.

  • The function dim gives the dimensions of a data structure.