# Introduction to network data in R # Author: Jake Fisher # This first lab will focus on how to manage data. We will cover: # - loading data & objects # - common data mangement tasks using dplyr & tidyr # - constructing network objects with statnet # - save and recall network (incl. vertex and edge) attributes # Time permitting, we will also discuss: # - managing lists # - functions ##### Loading data & objects ##### # R is an object-oriented language, which puts it somewhere between statistical # computing languages, like SAS, Stata, or SPSS, and traditional programming # languages like Python, Java, or C++. we will focus mainly on using R for # statistical computing. # Everything in R is saved as an object in your R workspace. For example, # let's save the number 10 to an object called "a": a <- 10 # Object names can contain letters, periods, underscores, and numbers. Objects # can contain almost anything. Let's save a new object: R.is.free <- "...but you get what you pay for" # Now we've saved a character string to the object R.is.free . We can print # the value of that object to the screen by typing the name of the object: R.is.free # And we can list the objects present in the workspace with ls() ls() # We can remove (delete) objects from the workspace with rm() rm(a) ls() # It's usually wise to start your scripts off by clearing out the workspace. # You do that with the following command: rm(list = ls()) # Note that we can nest two commands # For a typical data analysis exercise, we usually use a "data.frame". This # works like a rectangular dataset that you might see in any other program. df <- data.frame(x = 1:10, y = 11:20) # 1:10 creates a vector: 1, 2, ..., 10 df # R holds everything in the workspace in memory (RAM). This is similar to # Stata, and different from SAS, which reads in rows one at a time. Unlike # Stata, R can hold multiple datasets in memory at once. (df2 <- data.frame(a = letters[1:10], b = LETTERS[1:10])) # (Aside: letters is a vector of "a", "b", ..., "z", and LETTERS is a vector of # "A", "B", ..., "Z". We took the first 10 elements of each vector for this # data.frame. Wrapping an assignment statement in parentheses will print the # assigned object to the screen. E.g., (a <- "something") will print whatever # is stored in a to the screen.) # Both datasets are held in memory, which will make merges, etc., easy. It will # also mean that we need to refer to a particular dataset when we try to work # with the variables in that dataset. ls() # Most of the time, we will not enter data directly in our R script. Instead, # we will read it in from an external file. # It is often helpful to set the working directory, to tell R where to look for # files. setwd("C:/Users/fishe/Box Sync/Home Folder jcf26/SNH2017/Instructor dropbox") # Note: forward slashes # If you want to check your working directory, you can use: getwd() # All other filenames will be relative to that location. # R has functions to read in CSV files dat <- read.csv("ahs_wpvar.csv", stringsAsFactors = FALSE) # It can also read in Stata .dta files, and SAS .sas7bdat files using # contributed packages (foreign, and sas7bdat, respectively). # If you like to "look" at your data, Rstudio can open a spreadsheet viewer, # but really it's easier to look in the console head(dat) # only prints the first 6 rows # We can also print just the column names using names(), the dimensions using # dim(), or the number of rows or columns using nrow() or ncol() names(dat) dim(dat) nrow(dat) ncol(dat) ##### Data management with dplyr & tidyr ##### # Data management in base R is, frankly, a big headache. We will use a new, # contributed package called dplyr to manage the data. # If you haven't already, install dplyr: # install.packages("dplyr") # This only needs to be done once per computer you use R on. # To use a contributed package, you have to load it using the library command. # This must be done in every R script you run that uses the package. library(dplyr) # dplyr breaks data management into (about) 5 simple tasks: subsetting by rows, # subsetting by columns, sorting, and creating new variables # To subset by rows, we use the filter command. This creates a dataset with # only smokers: smokers <- filter(dat, PSMOKES == 1) head(smokers) # note that PSMOKES is only 1's # To subset by columns, we use the select command. This creates a dataset with # only the columns for grade and smoker grade.smoker <- select(dat, grade, PSMOKES) head(grade.smoker) # Notice there are only 2 columns now # To sort, we use arrange sorted.grade <- arrange(grade.smoker, grade) head(sorted.grade) # To create new variables, we use mutate hs.smokers <- mutate(grade.smoker, hs = grade > 8) head(hs.smokers) # dplyr also provides the %>% (read: "pipe") operator, which allows you to # string these simple commands together. The pipe passes the output of the # previous command to the first argument of the next command, or to a position # that you specify with a period. E.g., if you're sorting the data, these are # equivalent: head(arrange(hs.smokers, desc(grade))) hs.smokers %>% arrange(., desc(grade)) %>% head(.) hs.smokers %>% arrange(desc(grade)) %>% head # I will use the latter construction from now on. Let's recreate hs.smokers # using chained commands: hs.smokers <- dat %>% select(grade, PSMOKES) %>% mutate(hs = grade > 8) # Recoding variables with mutate deserves extra attention, because it's really # important. The structure of a mutate command is # mutate(new_var1 = (vector1), new_var2 = (vector2), ...) # So, to create a new variable, we have to create a vector with the new values # to assign to it. In the case above, grade > 8 created a logical (i.e., # TRUE/FALSE) vector indicating whether each element of the column grade was # greater than 8. # The most common uses for this are: # 1. calculating a new number from an existing number (or numbers), # 2. creating categorical variables from numeric ones # As an example of 1, let's calculate 3 variables: a logged variable, a squared # variable, and an interaction of two numeric variables # To modify a dataset "in place" -- meaning that you do not make a new object # from the dataset, you can use the %<>% operator from the magrittr package. # Install magrittr if you haven't already: # install.packages("magrittr") # Load magrittr library(magrittr) dat %<>% mutate(idg.log = log(IDG_SS), idg.sq = IDG_SS ^ 2, noms.times.grade = totnoms * grade) head(dat) # As an example of 2, let's split IDG_SS into 3 categories. To do that, we will # split nominations into categories. # First, an aside -- let's see what the distribution of IDG_SS looks like. # To look at the distribution, we use ggplot2, a popular visualization package # install.packages("ggplot2") library(ggplot2) qplot(x = IDG_SS, data = dat, geom = "histogram") # Let's do 3 categories: 0 (isolate), 1 - 10 (average), 11 - max (popular). # We'll use derivedFactor from the mosaic package for this. # install.packages("mosaic") library(mosaic) dat %<>% mutate(popularity = derivedFactor( isolate = IDG_SS == 0, average = IDG_SS < 11, .default = "high", .method = "first")) # Note that derivedFactor is quite slow on large datasets. Other approaches to # this include recode from the car package, or cut from base R. # Let's check and make sure we did that right: with(dat, table(popularity, IDG_SS)) # What does the distribution of this look like? qplot(x = popularity, data = dat, geom = "bar") # Fun with ggplot... p <- qplot(x = popularity, y = IDG_SS, data = dat, geom = "boxplot") + coord_flip() p p + scale_y_sqrt() # sqrt instead of log because of 0 values p + theme_bw() # dplyr also contains a number of other useful functions -- like the left_join, # right_join, inner_join, etc., functions for merges -- that I will not cover # here. Read the vignette for dplyr to get a better idea of how to use these # functions. vignette("introduction", package = "dplyr") vignette(package = "dplyr") # view a list of all the vignettes for dplyr # (Most packages have a vignette to explain how to use them. It's often worth # reading.) ##### Creating a network object ##### # Following best practices, we will begin by setting up the workspace: rm(list = ls()) setwd("C:/Users/fishe/Box Sync/Home Folder jcf26/SNH2017/Instructor dropbox") # Note: forward slashes library(statnet) # lots of messages... # Statnet can create a network object from any of the common network data # structures: edgelists, adjacency lists, and adjacency matrices. # Let's start with an example of an adjacency matrix. Here I'll define a # network where a vertex is a member of my immediate family, and an edge # indicates "parent": people <- c("mom", "dad", "sister", "me") (adj.mat <- matrix(0, nrow = 4, ncol = 4)) # create empty matrix rownames(adj.mat) <- people colnames(adj.mat) <- people adj.mat["mom", "sister"] <- 1 adj.mat["mom", "me"] <- 1 adj.mat["dad", "sister"] <- 1 adj.mat["dad", "me"] <- 1 adj.mat # completed adjacency matrix # We can create a network object using the network() function (family <- network(adj.mat, matrix.type = "adjacency")) # Network objects have a number of special functions for them. We will cover # these in the next section. For now, let's plot the network plot(family) # not terribly informative # Note that we could have expressed the information above more compactly if we # just listed the cell values that should be 1 (i.e., mom -> sister, mom -> me, # etc.) and assumed everything else should be 0. That's called an edgelist. # We can generate a network from an edgelist, too: (edges <- data.frame(from = c(rep("mom", 2), rep("dad", 2)), to = rep(c("sister", "me"), 2))) family.2 <- network(edges, matrix.type = "edgelist") # We can show the adjacency matrix from the network object using as.matrix as.matrix(family.2) # order of the rows is different, but network is the same # And finally, we could compress this even more by holding one the sender in # one column, and the receiver in subsequent columns. This is called an # adjacency list. (adj.list <- data.frame(parent = c("mom", "dad"), child1 = rep("sister", 2), child2 = rep("me", 2))) # Statnet doesn't handle adjacency lists well, but they are the most common type # of raw survey network data. Fortunately, we can convert from an adjacency # list to an edgelist easily, with the tidyr package library(tidyr) library(dplyr) # for the pipe operator library(magrittr) # for the %<>% operator adj.list %>% gather(key = "nomination.number", value = "alter", -parent) # Gather strings your data out into a long column. We can select which # variables to string out -- in this case, we wanted to keep the column for # parent separate, and to repeat the values for each child. # If you drop the middle column, you have an edgelist... adj.list %>% gather(key = "nomination.number", value = "alter", -parent) %>% select(-nomination.number) # ...which you can pass to network! adj.list %>% # note that we didn't save the output gather(key = "nomination.number", value = "alter", -parent) %>% select(-nomination.number) %>% network(., matrix.type = "edgelist") # That's great, but how do we do it with a real dataset? Let's try with the # faux Add Health data. dat <- read.csv("ahs_wpvar.csv", stringsAsFactors = F) # So what's in here, anyway? names(dat) # First of all, there are several communities in these data. Let's pick the # first one: cmty1 <- dat %>% filter(commcnt == 1) # Notice that the "friend" variables contain "fnid", and no other variables do. # We can use that to create an edgelist: el <- cmty1 %>% select(ego_nid, contains("fnid")) %>% gather(key = "nomination", value = "alter", -ego_nid) head(el) # hmmmm.... contains 99999 and NA values. Let's standardize those el %>% arrange(ego_nid) %>% head el %<>% mutate(alter = ifelse(alter == 99999, NA, alter)) # Now you would think this would work as a network... el %>% select(-nomination) %>% network(., matrix.type = "edgelist") # but statnet also can't really deal with missing values in the edgelist. So # the best way to do this is to construct an empty network first, and then # add the edges from the edgelist. This has the advantage of including isolates # too. # To do that, first we get a list of the people who are involved in the network: # Get unique list of people in the network (ah.people <- with(el, unique(c(ego_nid, alter)))) # this would be all the people, including those out of school # ah.people <- unique(dat$ego_nid) # this is only people who took the survey el %<>% filter(!is.na(alter)) # drop missing cases # We're not going to do that, because it includes a lot of people who didn't # take the survey. # Now create an empty network add.health <- network.initialize(n = length(ah.people)) # statnet adds edges by index values, not by names of variables. So, to make # this work, we have to create an index value of 1, 2, ..., N, where N is the # number of people in the network el %<>% mutate(ego.idx = match(ego_nid, ah.people), alter.idx = match(alter, ah.people)) # Finally, add the edges from the edgelist add.edges(add.health, el$ego.idx, el$alter.idx) # Now, as a quick gut check, let's plot it: set.seed(919) plot(add.health) # (As an aside, note that you could create a list of all the networks using # lapply...) ##### Save and recall network attributes ##### # One of the benefits of statnet is that it's possible to save attributes for # different pieces of the network. We can save vertex attributes, like whether # the person is a smoker, edge attributes, like whether the ties is homophilous, # or network attributes, like the name of the school, or the survey wave. The # procedure for this is simple; let's illustrate a few. # First, network attributes are characteristics of the network as whole. Set # them with %n% add.health %n% "cmty" <- 1 add.health # note "cmty" in there # Second, vertex attributes are characteristics of each vertex. Set them with # %v% add.health %v% "sex" <- cmty1$sex # we can do this because vertices and data # are sorted in the same way # A special case of vertex attributes is vertex names, which you set with # network.vertex.names() network.vertex.names(add.health) <- cmty1$ego_nid # All of these methods can also be used to retrive network attributes add.health %v% "sex" # Third, edge attributes are characteristics of a pair of vertices. You can set # them with %e%, but the indexing is more complicated, and I will not cover it # here. # These are useful for plotting and advanced statistical methods. plot(add.health, vertex.col = "sex") ##### Managing lists ##### # data.frames are the R equivalent of a rectangular dataset in SAS, Stata, or # SPSS. R's strength, however, is in its flexibility in the types of objects # it can store. That's why it has become so popular for dealing with complex # data sources, like networks or text. The data structure underlying all of # these objects is a list. # A list is exactly what it sounds like -- it is an ordered collection of # objects. Those objects can be anything, including other lists! my.list <- list(letters[1:10], dat, "all the things") my.list # We can refer to items in a list by number, using double brackets my.list[[2]] # dat # We can take more than one at a time using single brackets: my.list[1:2] # letters[1:10] and dat # Or we can name the elements in our list, and then refer to items by name names(my.list) <- c("some_letters", "dat", "all_things") my.list$all_things my.list[["all_things"]] # same # Why is this important? For starters, a data.frame is really just stored as # a list of columns dat$grade # prints just the column "grade" str(dat) # view the data structure for dat # This is helpful if we want to calculate values on that column mean(dat$grade) # It's also useful, because R can loop (i.e., perform the same action # repeatedly) easily with lists. # A typical loop looks something like this: for (i in 1:3) { print(my.list[[i]]) # prints each element of my.list, one at a time } # But that's speaking R with a strong C (or SAS) accent. The R way of doing # this is to use the *apply family of functions. Each one of them takes a # function, and applies it to each element of the list. They are named with a # prefix that indicates what kind of output you'll get -- lapply returns a list, # and sapply returns a simplified object. # Let's replicate the example above. lapply(my.list, print) # Now let's take a more practical example -- what if I wanted to calculate the # mean for each of the columns in dat? lapply(dat, mean) sapply(dat, mean) # note that this simplifies it into a vector sapply(dat, mean, na.rm = T) # we can pass addition arguments to the function dat.means <- sapply(dat, mean, na.rm = T) # save output mean(dat$sdummy) # Equivalent dat.means <- c() # create empty vector for storage for (i in 1:ncol(dat)) { dat.means[i] <- mean(dat[[i]], na.rm = T) } # Notice the single brackets on the output. We're refering to a specific # element of a vector by its index (this is called indexing). Indexing is # another way to do recoding. # For example, let's recode grade == 0 to NA (R's special character for missing) dat[, "grade"] # prints the column for grade dat$grade # equivalent dat$grade == 0 # prints a logical (T/F) vector indicating when grade is 0 which(dat$grade == 0) # prints the index of which values are TRUE dat[dat$grade == 0, "grade"] # prints just the elements of "grade" that are 0 dat$grade[dat$grade == 0] # equivalent dat[dat$grade == 0, "grade"] <- NA # resets those values to NA dat$grade[dat$grade == 0] <- NA # equivalent # You would repeat this for each category. (This is cumbersome compared to # dplyr!) # The *apply functions are useful whenever you want to loop through several # things. For example, here's how you would run several different model # specifications for a linear model. # One linear model looks like this: lm(POP_BEH ~ IDG_SS, data = dat) # (For the record, a logistic regression looks like this:) glm(PSMOKES ~ IDG_SS, data = dat, family = binomial(link = "logit")) # Let's specify a list of models: models <- c( # Again, c stands for 'closure' -- it's a way of making a vector POP_BEH ~ IDG_SS + grade, POP_BEH ~ grade, POP_BEH ~ sex ) # Now, to run all the models, we use lm with lapply (fits <- lapply(models, lm, data = dat)) # Print the summaries for both the models: lapply(fits, summary) # Ta-da! The object that we saved is a list of fitted model outputs. You could # similarly run the model using different subsets of the data: sex.models <- lapply(1:2, function(x) lm(POP_BEH ~ IDG_SS, data = dat, subset = sex == x)) # The next section will explain how that last lapply statement worked. ##### Functions ##### # The other reason that R has become so popular is that it is very easy to build # your own functions. Unlike SAS macros (or, to a lesser extent, Stata ado # files), custom R functions are treated just like the built-in functions, and # have a centralized distribution system (CRAN). This has made it very popular # for statisticians, who use it to distribute software to do cutting edge # analyses. # You can, and sometimes should, build your own functions to make your code # easier to read and maintain. The underlying goal is DRY: don't repeat # yourself. If you find yourself copying and pasting code often, or if you # have trouble parsing what a nested set of commands means, you should probably # be writing a function. # At their core, functions take an input, and produce an output. We call input # the set of arguments that the function takes, and we call the output the # return value of the function. # Here's a simple function to calculate the mean manually: newMean <- function(x) { # Calculates the mean of a numeric vector # # Args: # x: a numeric vector # # Returns: # the arithmatic mean of x return(sum(x) / length(x)) } newMean(1:10) mean(1:10) # same # Because R is open-source, you can view the code for almost any function. For # example, look at the code for lm: lm # Virtually all functions come with a help file, too. Access that with: ?lm # shows arguments, return values, etc. # Of course, it's typically unwise to recreate existing functions, if for no # other reason than that it's time consuming. Creating functions is useful # because you can create custom output. For example, let's create a function # that produces a PROC MEANS style summary of a numeric variable (mean, sd, min, # max) procMeans <- function(x) { # Creates a data.frame with the mean, sd, min, and max of a numeric variable # # Args: # x: a numeric vector # # Returns: # a data.frame, with columns for the mean, sd, min, and max of x return(data.frame(mean = mean(x), sd = sd(x), min = min(x), max = max(x))) } # Running this on one variable of dat: procMeans(dat$IDG_SS) # But, since a data.frame is just a list of columns, we can use lapply to loop # through a bunch of columns at once: lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans) # Not quite what we wanted -- let's pipe the result to the bind_rows function # from dplyr, which squashes together the data.frames by rows lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans) %>% bind_rows # We will see that this method is useful when we are calculating a series of # statistics on networks, or on individuals in networks.