# Introduction to R -- data management # Author: Jake Fisher # This first lab will focus on how to manage data. We will cover: # - loading data & objects # - common data mangement tasks using dplyr & tidyr # - managing lists # - functions ##### Loading data & objects ##### # R is an object-oriented language, which puts it somewhere between statistical # computing languages, like SAS, Stata, or SPSS, and traditional programming # languages like Python, Java, or C++. we will focus mainly on using R for # statistical computing. # Everything in R is saved as an object in your R workspace. For example, # let's save the number 10 to an object called "a": a <- 10 # Object names can contain letters, periods, underscores, and numbers. Objects # can contain almost anything. Let's save a new object: R.is.free <- "...but you get what you pay for" # Now we've saved a character string to the object R.is.free . We can print # the value of that object to the screen by typing the name of the object: R.is.free # And we can list the objects present in the workspace with ls() ls() # We can remove (delete) objects from the workspace with rm() rm(a) ls() # It's usually wise to start your scripts off by clearing out the workspace. # You do that with the following command: rm(list = ls()) # Note that we can nest two commands # For a typical data analysis exercise, we usually use a "data.frame". This # works like a rectangular dataset that you might see in any other program. df <- data.frame(x = 1:10, y = 11:20) # 1:10 creates a vector: 1, 2, ..., 10 df # R holds everything in the workspace in memory (RAM). This is similar to # Stata, and different from SAS, which reads in rows one at a time. Unlike # Stata, R can hold multiple datasets in memory at once. (df2 <- data.frame(a = letters[1:10], b = LETTERS[1:10])) # (Aside: letters is a vector of "a", "b", ..., "z", and LETTERS is a vector of # "A", "B", ..., "Z". We took the first 10 elements of each vector for this # data.frame. Wrapping an assignment statement in parentheses will print the # assigned object to the screen. E.g., (a <- "something") will print whatever # is stored in a to the screen.) # Both datasets are held in memory, which will make merges, etc., easy. It will # also mean that we need to refer to a particular dataset when we try to work # with the variables in that dataset. ls() # Most of the time, we will not enter data directly in our R script. Instead, # we will read it in from an external file. # It is often helpful to set the working directory, to tell R where to look for # files. setwd("C:/Users/fishe/Dropbox/SN&H Files/SampleData") # Note: forward slashes # If you want to check your working directory, you can use: getwd() # All other filenames will be relative to that location. # R has functions to read in CSV files dat <- read.csv("ahs_wpvar.csv", stringsAsFactors = FALSE) # It can also read in Stata .dta files, and SAS .sas7bdat files using # contributed packages (foreign, and sas7bdat, respectively). # If you like to "look" at your data, Rstudio can open a spreadsheet viewer, # but really it's easier to look in the console head(dat) # only prints the first 6 rows # We can also print just the column names using names(), the dimensions using # dim(), or the number of rows or columns using nrow() or ncol() names(dat) dim(dat) nrow(dat) ncol(dat) ##### Data management with dplyr & tidyr ##### # Data management in base R is, frankly, a big headache. We will use a new, # contributed package called dplyr to manage the data. # If you haven't already, install dplyr: install.packages("dplyr") # This only needs to be done once per computer you use R on. # To use a contributed package, you have to load it using the library command. # This must be done in every R script you run that uses the package. library(dplyr) # dplyr breaks data management into (about) 5 simple tasks: subsetting by rows, # subsetting by columns, sorting, and creating new variables # To subset by rows, we use the filter command. This creates a dataset with # only smokers: smokers <- filter(dat, PSMOKES == 1) head(smokers) # note that PSMOKES is only 1's # To subset by columns, we use the select command. This creates a dataset with # only the columns for grade and smoker grade.smoker <- select(dat, grade, PSMOKES) head(grade.smoker) # Notice there are only 2 columns now # To sort, we use arrange sorted.grade <- arrange(grade.smoker, grade) head(sorted.grade) # To create new variables, we use mutate hs.smokers <- mutate(grade.smoker, hs = grade > 8) head(hs.smokers) # dplyr also provides the %>% (read: "pipe") operator, which allows you to # string these simple commands together. The pipe passes the output of the # previous command to the first argument of the next command, or to a position # that you specify with a period. E.g., if you're sorting the data, these are # equivalent: head(arrange(hs.smokers, desc(grade))) hs.smokers %>% arrange(., desc(grade)) %>% head(.) hs.smokers %>% arrange(desc(grade)) %>% head # I will use the latter construction from now on. Let's recreate hs.smokers # using chained commands: hs.smokers <- dat %>% select(grade, PSMOKES) %>% mutate(hs = grade > 8) # Recoding variables with mutate deserves extra attention, because it's really # important. The structure of a mutate command is # mutate(new_var1 = (vector1), new_var2 = (vector2), ...) # So, to create a new variable, we have to create a vector with the new values # to assign to it. In the case above, grade > 8 created a logical (i.e., # TRUE/FALSE) vector indicating whether each element of the column grade was # greater than 8. # The most common uses for this are: # 1. calculating a new number from an existing number (or numbers), # 2. creating categorical variables from numeric ones # As an example of 1, let's calculate 3 variables: a logged variable, a squared # variable, and an interaction of two numeric variables # To modify a dataset "in place" -- meaning that you do not make a new object # from the dataset, you can use the %<>% operator from the magrittr package. # Install magrittr if you haven't already: # install.packages("magrittr") # Load magrittr library(magrittr) dat %<>% mutate(idg.log = log(IDG_SS), idg.sq = IDG_SS ^ 2, noms.times.grade = totnoms * grade) head(dat) # As an example of 2, let's split IDG_SS into 3 categories. To do that, we will # split nominations into categories. # First, an aside -- let's see what the distribution of IDG_SS looks like. # To look at the distribution, we use ggplot2, a popular visualization package # install.packages("ggplot2") library(ggplot2) qplot(x = IDG_SS, data = dat, geom = "histogram") # Let's do 3 categories: 0 (isolate), 1 - 10 (average), 11 - max (popular). # We'll use derivedFactor from the mosaic package for this. # install.packages("mosaic") library(mosaic) dat %<>% mutate(popularity = derivedFactor( isolate = IDG_SS == 0, average = IDG_SS < 11, .default = "high", .method = "first")) # Note that derivedFactor is quite slow on large datasets. Other approaches to # this include recode from the car package, or cut from base R. # Let's check and make sure we did that right: with(dat, table(popularity, IDG_SS)) # What does the distribution of this look like? qplot(x = popularity, data = dat, geom = "bar") # Fun with ggplot... p <- qplot(x = popularity, y = IDG_SS, data = dat, geom = "boxplot") + coord_flip() p p + scale_y_sqrt() # sqrt instead of log because of 0 values p + theme_bw() # dplyr also contains a number of other useful functions -- like the left_join, # right_join, inner_join, etc., functions for merges -- that I will not cover # here. Read the vignette for dplyr to get a better idea of how to use these # functions. vignette("introduction", package = "dplyr") vignette(package = "dplyr") # view a list of all the vignettes for dplyr # (Most packages have a vignette to explain how to use them. It's often worth # reading.) ##### Managing lists ##### # data.frames are the R equivalent of a rectangular dataset in SAS, Stata, or # SPSS. R's strength, however, is in its flexibility in the types of objects # it can store. That's why it has become so popular for dealing with complex # data sources, like networks or text. The data structure underlying all of # these objects is a list. # A list is exactly what it sounds like -- it is an ordered collection of # objects. Those objects can be anything, including other lists! my.list <- list(letters[1:10], dat, "all the things") my.list # We can refer to items in a list by number, using double brackets my.list[[2]] # dat # We can take more than one at a time using single brackets: my.list[1:2] # letters[1:10] and dat # Or we can name the elements in our list, and then refer to items by name names(my.list) <- c("some_letters", "dat", "all_things") my.list$all_things my.list[["all_things"]] # same # Why is this important? For starters, a data.frame is really just stored as # a list of columns dat$grade # prints just the column "grade" str(dat) # view the data structure for dat # This is helpful if we want to calculate values on that column mean(dat$grade) # It's also useful, because R can loop (i.e., perform the same action # repeatedly) easily with lists. # A typical loop looks something like this: for (i in 1:3) { print(my.list[[i]]) # prints each element of my.list, one at a time } # But that's speaking R with a strong C (or SAS) accent. The R way of doing # this is to use the *apply family of functions. Each one of them takes a # function, and applies it to each element of the list. They are named with a # prefix that indicates what kind of output you'll get -- lapply returns a list, # and sapply returns a simplified object. # Let's replicate the example above. lapply(my.list, print) # Now let's take a more practical example -- what if I wanted to calculate the # mean for each of the columns in dat? lapply(dat, mean) sapply(dat, mean) # note that this simplifies it into a vector sapply(dat, mean, na.rm = T) # we can pass addition arguments to the function dat.means <- sapply(dat, mean, na.rm = T) # save output mean(dat$sdummy) # Equivalent dat.means <- c() # create empty vector for storage for (i in 1:ncol(dat)) { dat.means[i] <- mean(dat[[i]], na.rm = T) } # Notice the single brackets on the output. We're refering to a specific # element of a vector by its index (this is called indexing). Indexing is # another way to do recoding. # For example, let's recode grade == 0 to NA (R's special character for missing) dat[, "grade"] # prints the column for grade dat$grade # equivalent dat$grade == 0 # prints a logical (T/F) vector indicating when grade is 0 which(dat$grade == 0) # prints the index of which values are TRUE dat[dat$grade == 0, "grade"] # prints just the elements of "grade" that are 0 dat$grade[dat$grade == 0] # equivalent dat[dat$grade == 0, "grade"] <- NA # resets those values to NA dat$grade[dat$grade == 0] <- NA # equivalent # You would repeat this for each category. (This is cumbersome compared to # dplyr!) # The *apply functions are useful whenever you want to loop through several # things. For example, here's how you would run several different model # specifications for a linear model. # One linear model looks like this: lm(POP_BEH ~ IDG_SS, data = dat) # (For the record, a logistic regression looks like this:) glm(PSMOKES ~ IDG_SS, data = dat, family = binomial(link = "logit")) # Let's specify a list of models: models <- c( # Again, c stands for 'closure' -- it's a way of making a vector POP_BEH ~ IDG_SS + grade, POP_BEH ~ grade, POP_BEH ~ sex ) # Now, to run all the models, we use lm with lapply (fits <- lapply(models, lm, data = dat)) # Print the summaries for both the models: lapply(fits, summary) # Ta-da! The object that we saved is a list of fitted model outputs. You could # similarly run the model using different subsets of the data: sex.models <- lapply(1:2, function(x) lm(POP_BEH ~ IDG_SS, data = dat, subset = sex == x)) # The next section will explain how that last lapply statement worked. ##### Functions ##### # The other reason that R has become so popular is that it is very easy to build # your own functions. Unlike SAS macros (or, to a lesser extent, Stata ado # files), custom R functions are treated just like the built-in functions, and # have a centralized distribution system (CRAN). This has made it very popular # for statisticians, who use it to distribute software to do cutting edge # analyses. # You can, and sometimes should, build your own functions to make your code # easier to read and maintain. The underlying goal is DRY: don't repeat # yourself. If you find yourself copying and pasting code often, or if you # have trouble parsing what a nested set of commands means, you should probably # be writing a function. # At their core, functions take an input, and produce an output. We call input # the set of arguments that the function takes, and we call the output the # return value of the function. # Here's a simple function to calculate the mean manually: newMean <- function(x) { # Calculates the mean of a numeric vector # # Args: # x: a numeric vector # # Returns: # the arithmatic mean of x return(sum(x) / length(x)) } newMean(1:10) mean(1:10) # same # Because R is open-source, you can view the code for almost any function. For # example, look at the code for lm: lm # Virtually all functions come with a help file, too. Access that with: ?lm # shows arguments, return values, etc. # Of course, it's typically unwise to recreate existing functions, if for no # other reason than that it's time consuming. Creating functions is useful # because you can create custom output. For example, let's create a function # that produces a PROC MEANS style summary of a numeric variable (mean, sd, min, # max) procMeans <- function(x) { # Creates a data.frame with the mean, sd, min, and max of a numeric variable # # Args: # x: a numeric vector # # Returns: # a data.frame, with columns for the mean, sd, min, and max of x return(data.frame(mean = mean(x), sd = sd(x), min = min(x), max = max(x))) } # Running this on one variable of dat: procMeans(dat$IDG_SS) # But, since a data.frame is just a list of columns, we can use lapply to loop # through a bunch of columns at once: lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans) # Not quite what we wanted -- let's pipe the result to the bind_rows function # from dplyr, which squashes together the data.frames by rows lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans) %>% bind_rows # We will see that this method is useful when we are calculating a series of # statistics on networks, or on individuals in networks.