Data structures in R

## Prologue Every programming language has a few ways of organizing data. The smallest immutable variables are called atomic variables - more about it in the next section. These smaller variables can be arranged into more complex and useful systems of curating data. In R, we have vectors, matrices, lists, data frames, for organizing large amounts of data in useful ways. In Matlab, we have matrices, cells, structures. In Python we have lists, dictionaries. However, lists in R and lists in Python are very different entities. Every coding platform has its own data structures and here we shall specifically focus on the data structures used in R. ## Atomic Variables in R Atomic variables in R are basic data structures that store a single value. The value of an atomic variable cannot be further divided into simpler components. There are six types of atomic variables in R: - Numeric: used to store real numbers (e.g. 1.5, -3.14). Example: `x <- 3.14` - Integer: used to store integers (e.g. 1, -5). Example: `y <- -5` - Character: used to store character strings (e.g. "hello", "world"). Example: `z <- "hello"` - Logical: used to store logical values (TRUE or FALSE). Example: `a <- TRUE` - Complex: used to store complex numbers (e.g. 2+3i). Example: `b <- 2 + 3i` - Raw: used to store raw data, represented as a sequence of bytes. Example: `c <- charToRaw("hello")` It's important to note that atomic variables in R are immutable, meaning their values cannot be changed once assigned. ## ~~Non-atomic variables in R~~ There is no such thing as non-atomic variables in R. All variables in R are atomic, meaning that they can only store a single value of a single data type. This is in contrast to some other programming languages, such as C and C++, which allow for non-atomic variables. Non-atomic variables in C and C++ can store multiple values of different data types. This can be useful for some applications, but it can also make programs more difficult to write and debug. R does not allow for non-atomic variables because it is designed to be a simple and easy-to-use language. We, the biologists, shall most frequently use the first four types of atomic variables, i.e., numeric, integer, character and logical. Let's discuss them in detail. ## Vectors in R ### The numeric In biology, we encounter different formats of data. For instance, a zoologist may measure the lengths of Gangetic dolphins. Suppose one dolphin measures 2.44 meters. The lengths of species may come in fractions. Mathematically such variables are termed continuous variables. Such data is stored in R as ‘numeric’ data type. Lengths of several dolphins as measured in a given stretch of the river can be stored in a vector. For instance: ```r dolphin_lengths=c(2.44,2.58,1.28,2.34,1.57) ``` Now, if you type `dolphin_lengths` in the console you will see the following: ```r dolphin_lengths ``` `> [1] 2.44 2.58 1.28 2.34 1.57` In R, vectors are arrays which can hold different kinds of data. However, one vector can hold only one kind. More on this in a while. Now you can test the type of the vector now, as follows: ```r class(dolphin_lengths) ``` `> [1] "numeric"` ### The integer Now, imagine an epidemiologist is collecting data on the number of individuals infected with dengue in New Delhi in a given year. For example in 2015, Delhi registered 15,730 cases of dengue. The number of infected individuals can not be in fractions, and mathematically such data are called discrete variables, or integers. R stores such data is the ‘integer’ data type. ```r patients_Del15=c(3008, 2987, 5987,1876, 4987) ``` Now, if you test this vector to know its class: ```r class(patients_Del15) ``` `> [1] "numeric"` Yes, this one is also a numeric vector. If you want to construct an integer vector in R, you will have to add 'L' after each entry, as follows. ```r patients_Del15=c(3008L, 2987L, 5987L,1876L,4987L) class(patients_Del15) ``` `> [1] "integer"` Why L? And why bother about integer vectors? We shall revisit this. ### The characters Next, suppose a clinician is recording types of cancer patients a public hospital has admitted within a given period of time. Her entries are ‘pancreatic’, ‘breast’, ‘cervical’, ‘lung’ etc. R stores such data as character data type. Some times these are also called strings. ```r cancer_type=c("pancreatic","lung","lung","cervical","head and neck","pancreatic") class(cancer_type) ``` `> [1] "character"` ### The logicals Then, say, the clinician decides to record the smoking status of these patients. If the patient is a smoker, she assigns TRUE, otherwise she assigns FALSE in her records. Such Boolean variables are stored as logical data type in R. Let's start with a simple example of logical vectors. Let's say we are listing the enantiomers of organic molecules available in biological systems. ```r D.Amino <- FALSE L.Sugar <- FALSE D.Sugar <- TRUE L.Amino <- TRUE s<-(c(L.Sugar,D.Sugar,L.Amino,D.Amino)) print(s) ``` `> [1] FALSE TRUE TRUE FALSE` Here we first assigned logical values to variables. Then we stitched those variables in an array `s`. Finally, we printed `s`. Let's go back what our clinician is doing. She is assigning smoking status record in her cohort of patients. ```r smoking_status=c(FALSE,TRUE,TRUE,FALSE,FALSE,FALSE) class(smoking_status) ``` `> [1] "logical"` ## Coercion Let's say, a student, completely new to R, mistakenly decided that he will save the age, pulse rate, cancer type and smoking status in a vector. Thus he thought he will construct a vector for each patient. ```r patient_001=c(46.7,69,FALSE) ``` Then he typed the name of the array, `patient_001` in the console to check its contents. ```r patient_001 ``` `> [1] 46.7 69.0 0.0` The output is definitely not something he is expecting. He tests the class of the vector to understand what is going on: ```r class(patient_001) ``` `> [1] "numeric"` There were two numeric and and one logical item in the array `patient_001` during assignment. How come everything became numeric? If there are more than one kind of data within a single vector R converts them in to a single type. This behavior is called coercion. R converts the items in a vector towards more flexible data type. Here is the order of coercion, i.e., the flexibility of the data type in an increasing order: logical, integer, double, character. In our example 'patient_001', there was numeric and logical data types. Due to coercion everything was converted into double. Remember, since all items in a vectors in R belong to same data type, sometimes these vectors are called atomic vectors. The preference for coercion is the following code. The type of the variable x , which was double changes from double to character as we enforce one member of x to be a character. ```r x <- c(1, 2, 3, 4, 5) print(typeof(x)) x[3]='human' print(typeof(x)) ``` `> [1] "double"` `> [1] "character"` ## Index of vectors You may have noticed when we printed the outputs of atomic variables or atomic vectors the display begins with 1 in square brackets. These simply indicate the index of the vectors. You may ask the question how come atomic variables are indexed? After all, they contain just one value, right? What is the point of indexing them? Good question. The atomic variables in R are also vectors, which contain only one variable. In R, there are no dedicated data structure for a variable with a single value. Be it one value or an array of values everything is a vector. The readers who are new to programming should notice that the indexing starts with 1. The other readers who are already exposed to languages such as Python this might be a surprise - since in Python it starts with 0. ## Length of a vector You can know the number of items in a vector using the `length()` function. We shall talk about functions at length. But in this section we only talk about the `length()` function. Let's use a previously defined vector for the example. ```r s<-(c(L.Sugar,D.Sugar,L.Amino,D.Amino)) length(s) ``` `> [1] 4` So, the length function tells us here that the length of the vector `s` is 4, i.e., there are four items in vector `s`. ## Matrices Vector, matrix, or list - all of these data structures are made up of multiple atomic variables, and they can be used to store and manipulate data in a more efficient and organized way. Next we discuss matrices and arrays. A matrix is a two-dimensional array of elements, where all elements are of the same type. Example: ```r y <- matrix(1:6, nrow = 2, ncol = 3) print(y) ``` ``` > [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 ``` The colon between two numeric entries generate the numeric vector containing all numbers within the range with unit intervals. So, 1:6 generated all numbers between 1 and 6. The matrix function arranged the numbers in a matrix, column wise. We can also tell the matrix function to arrange numbers row-wise, by setting the `byrow` argument to TRUE. This argument is set to `FALSE` by default. ```r y <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE) print(y) ``` ``` > [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 ``` Let us next create a matrix with random numbers having six rows and six columns. ```r M<-matrix(rnorm(9),nrow = 3, ncol = 3) print(M) ``` ``` > [,1] [,2] [,3] [1,] -0.3322998 0.1121791 0.1277204 [2,] -0.6603980 -0.6899717 0.6453514 [3,] -0.9595657 -0.3490316 -1.5582268 ``` ## Arrays Arrays and matrices are largely similar data structures in R, however, the arrays can have more than two dimensions. For example: ```r z <- array(1:12, dim = c(2,3,2)) print(z) ``` ``` > , , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12 ``` We specify the dimensions of z as a vector `c(2,3,2)`, and we feed that to the option `dim`. As you can guess, this vector means that we require 2 rows, 3 columns and two of those 2 cross 3 matrices. Notice that when we print the array, it prints the third dimension. `, , 1` means that the first item in the third dimension is displayed, which includes all rows and columns of the first two dimensions. ## Lists A list is a collection of objects of different types, including other lists. Lists are especially useful when several types of data are required to be saved in the same data structure. Let's begin our discussion on lists with a simple example: ```r Start.Codons_Ecoli<- list(c("ATG%", "GTG%", "TTG%") ,c(83, 14,3)) print(Start.Codons_Ecoli) ``` ``` > [[1]] [1] "ATG%" "GTG%" "TTG%" [[2]] [1] 83 14 3 ``` The list named `Start.Codons_Ecoli` was created using the function `list()` - where `Start.Codons_Ecoli` is a list of two vectors. Notice that when we print the list each of the vectors are kept in the fields indexed as shown within double square brackets. Also notice that keeping the character vector and numeric vectors in two different fields did not cause any unsolicited trouble such as coercion. Let's say we are keeping the information of a patient in a list, as the following: ```r patient_k <- list(patient_id = "J49", sex = "M", age = 47.5, smoker = TRUE ) print(patient_k) ``` ``` > $patient_id [1] "J49" $sex [1] "M" $age [1] 47.5 $smoker [1] TRUE ``` Notice that the names of each of the fields, or the tags, are preceded by a `

sign. You can use the same sign for accessing the fields in a list. ```r patient_k$age ``` ``` > [1] 47.5 ``` Further, tags are not strictly necessary to build a list, and the fields can be also accessed by indexes in double square brackets, in presence or absence of tags. Let's use both of our examples of lists again to show this: ```r Start.Codons_Ecoli[[1]] ``` ``` [1] "ATG%" "GTG%" "TTG%" ``` With the second example: ```r patient_k[[1]] ``` ``` > [1] "J49" ``` ### Lists vs Vectors Importantly, unlike vectors, the components of lists do not have to be atomic. We can keep vectors, lists, data frames or matrices as the components of lists. For this reason sometimes lists in R are also known as recursive vectors. Let's discuss one such example. We need to record the known allergens and current medications and doses for our patient. We can append the existing list as the following: ```r patient_k$allergens <- c("pollen", "soy", "nuts") patient_k$medicines <- list(drugA = "2-doses", drugB = "1-dose") print(patient_k) ``` Now we look at the appended list: ``` > $patient_id [1] "J49" $sex [1] "M" $age [1] 47.5 $smoker [1] TRUE $allergens [1] "pollen" "soy" "nuts" $medicines $medicines$drugA [1] "2-doses" $medicines$drugB [1] "1-dose" ``` Any component of the list k can be accessed by the right use of indexing. Indexes of several layers can be concatenated. Let's say we need to know the first allergen of the patient: ```r k$allergens[1] ``` ``` > [1] "pollen" ``` ## Data Frames Data frames are table-like structures, where each column can have a different data types. Let's begin with a toy example: ```r col1 <- c(1, 2, 3) col2 <- c('a','b','c') dd <- data.frame(col1, col2) ``` Let's look at `dd` now. You can double click on the variable displayed in the global environment, and you should see the following. Or, you can try the `view()` function in console, as well. ```r view(dd) ``` | | col1 | col2 | | --- | ---- | ---- | | 1 | 1 | a | | 2 | 2 | b | | 3 | 3 | c | ### The most important R data structure In our experience, data frames are arguably the most important data structure in R. In your day to day R coding you will encounter this data structure much more frequently than the other ones. The reason data frames are so popular is that it brings together the best and the most useful properties of different data structures in R. A data frames is basically a list of vectors, where all such vectors must be of the same length. Thus a data frame contains different types of atomic vectors seamlessly knitted in a table-like format. That is exactly the major advantage of data frame over a matrix - a matrix is an example of atomic vectors, hence can contain only one type of data, such as strings or numerics. Keeping different types of data together is central to the data science today: the strings, the numeric, the logical items have to be together in a systematic tabulature for data wrangling - and data frames do exactly that. Moreover, the powerful subsetting techniques of matrices and vectors as described in the previous sections are perfectly applicable to data frames. Thus they become the power house of the data analysis in R platform. Hence, understanding data frames is a central pillar of learning to code in R. ### Generating data frames Let's create a data frame using some of the vectors which we created in the previous sections. ```r Patient_id <- c("J087", "W066", "W003", "C189", "H654", "D230") Age <- c(39, 33, 78, 51, 44, 65) cancer_type=c("pancreatic","lung","lung","cervical","head and neck","pancreatic") smoking_status=c(FALSE,TRUE,TRUE,FALSE,FALSE,FALSE) cancer_data <- data.frame(Patient_id, Age, cancer_type, smoking_status, stringsAsFactors = FALSE) ``` Now let's see what is inside the data frame `cancer_data`: ``` view(cancer_data) ``` | | Patient_id | Age | cancer_type | smoking_status | | --- | ---------- | --- | ------------- | -------------- | | 1 | J087 | 39 | pancreatic | FALSE | | 2 | W066 | 33 | lung | TRUE | | 3 | W003 | 78 | lung | TRUE | | 4 | C189 | 51 | cervical | FALSE | | 5 | H654 | 44 | head and neck | FALSE | | 6 | D230 | 65 | pancreatic | FALSE | We generated two new vectors: `Patient_id` and `Age`. We already had the vectors `cancer_type` and `smoking_status` from the examples of vectors - we just copied them here again. We put together all of them using the function `data.frame`. What is the use of the argument `stingsAsFactors`? When we incorporate character vectors in a data frame, they are converted to factors by default. We shall discuss on the factor data structures in the next section. For now just keep in mind that this default string to factor conversion is not generally useful, in our experience, for the analysis of most biological data. Hence we set it to `FALSE` here. There could be scenarios when conversion to factors can be useful - we shall discuss those in due course. ### Creating data frames from matrices Although it is not used very frequently, we wanted to keep a note that data frames can be generated directly from matrices. We are using our previously generated matrix for this example. ```r M<-matrix(rnorm(9),nrow = 3, ncol = 3) D<-data.frame(M) ``` We can look at `D` now: ```r view(D) ``` | | X1 | X2 | X3 | | --- | ---------- | ---------- | ----------- | | 1 | -0.6300989 | -0.5303444 | 0.08022186 | | 2 | 0.1215060 | 0.8166656 | 0.60272181 | | 3 | 0.1670813 | 0.4464050 | -0.75050666 | The numerical values will not match with what we show here vs when you try the code in your own machine. `rnorm` is a random number generator, we shall discuss it later. This is just to show you that you can take any vector then convert that into matrix then convert the matrix into a data frame. ## Factors Factors are a categorical variable that stores values that can take on a limited number of possible values, called "levels". Let's start with a toy example: ```r f <- factor(c("A", "B", "A")) print(f) ``` ``` > [1] A B A Levels: A B ``` ### Factor vs vector The difference between vectors and factors is that R stores the values of a factor as the levels. Here is another example of a factor with atomic variables, which we open up to see what is inside a factor: ```r sex <- factor(c("Male", "Female", "Female"), levels=c("Female", "Male")) unclass(sex) ``` ``` > [1] 2 1 1 attr(,"levels") [1] "Female" "Male" ``` Notice that at the core a factor is saving the items as integers. These integers are the levels of the factor. The level-1 and level-2 values are stored separately as strings. However, the length of the factor is the number of items in the factor, similar to the operation in vectors. ```r length(sex) ``` ``` > [1] 3 ``` ### Anticipatory levels It is possible to include anticipatory levels in a factor. The idea is as more data is added to a factor, new levels may appear. However, it is important to introduce them while defining the factor. ```r gene_products <- c("enzyme", "hormone", "RNA") gene_prod_factor <- factor(gene_products, levels <- c("enzyme", "hormone", "transcription factors", "RNA")) print(gene_prod_factor) ``` ``` > [1] enzyme hormone RNA Levels: enzyme hormone transcription factors RNA ``` Notice that although the factor has only three items, it has four levels. Now we can append the factor with a few more inputs. ```r gene_prod_factor[4] <- "RNA" gene_prod_factor[5] <- "transcription factors" print(gene_prod_factor) ``` ``` > [1] enzyme hormone RNA RNA [5] transcription factors Levels: enzyme hormone transcription factors RNA ``` We could insert the string "transcription factors" in the factor because the level with the same string were already introduced while defining the factor. You can not insert undefined levels in factor, you get a warning if you try, while the new entry becomes `NA`: ```r gene_prod_factor[6] <- "structural proteins" print(gene_prod_factor) ``` ``` Warning message: In `[<-.factor`(`*tmp*`, 6, value = "structural proteins") : invalid factor level, NA generated > [1] enzyme hormone RNA RNA [5] transcription factors <NA> Levels: enzyme hormone transcription factors RNA ``` We shall discuss about `NA` at length in a different section. For now just keep in mind that factors do not allow items with a fresh level, once the levels are set.