DATA MANAGEMENT

Download R source file


              # Introduction to network data in R
# Author: Jake Fisher

# This first lab will focus on how to manage data.  We will cover:
# - loading data & objects
# - common data mangement tasks using dplyr & tidyr
# - constructing network objects with statnet
# - save and recall network (incl. vertex and edge) attributes

# Time permitting, we will also discuss:
# - managing lists
# - functions

##### Loading data & objects #####
# R is an object-oriented language, which puts it somewhere between statistical
# computing languages, like SAS, Stata, or SPSS, and traditional programming
# languages like Python, Java, or C++.  we will focus mainly on using R for
# statistical computing.

# Everything in R is saved as an object in your R workspace.  For example,
# let's save the number 10 to an object called "a":
a <- 10

# Object names can contain letters, periods, underscores, and numbers.  Objects
# can contain almost anything.  Let's save a new object:
R.is.free <- "...but you get what you pay for"

# Now we've saved a character string to the object R.is.free .  We can print
# the value of that object to the screen by typing the name of the object:
R.is.free

# And we can list the objects present in the workspace with ls()
ls()

# We can remove (delete) objects from the workspace with rm()
rm(a)
ls()

# It's usually wise to start your scripts off by clearing out the workspace.
# You do that with the following command:
rm(list = ls())  # Note that we can nest two commands

# For a typical data analysis exercise, we usually use a "data.frame".  This
# works like a rectangular dataset that you might see in any other program.
df <- data.frame(x = 1:10, y = 11:20)  # 1:10 creates a vector: 1, 2, ..., 10
df

# R holds everything in the workspace in memory (RAM).  This is similar to 
# Stata, and different from SAS, which reads in rows one at a time.  Unlike
# Stata, R can hold multiple datasets in memory at once.
(df2 <- data.frame(a = letters[1:10], b = LETTERS[1:10]))

# (Aside: letters is a vector of "a", "b", ..., "z", and LETTERS is a vector of
# "A", "B", ..., "Z".  We took the first 10 elements of each vector for this
# data.frame.  Wrapping an assignment statement in parentheses will print the
# assigned object to the screen.  E.g., (a <- "something") will print whatever
# is stored in a to the screen.)

# Both datasets are held in memory, which will make merges, etc., easy.  It will
# also mean that we need to refer to a particular dataset when we try to work
# with the variables in that dataset.
ls()

# Most of the time, we will not enter data directly in our R script.  Instead,
# we will read it in from an external file.

# It is often helpful to set the working directory, to tell R where to look for
# files.
setwd("C:/Users/fishe/Box Sync/Home Folder jcf26/SNH2017/Instructor dropbox")  # Note: forward slashes

# If you want to check your working directory, you can use:
getwd()

# All other filenames will be relative to that location.
# R has functions to read in CSV files
dat <- read.csv("ahs_wpvar.csv", stringsAsFactors = FALSE)

# It can also read in Stata .dta files, and SAS .sas7bdat files using 
# contributed packages (foreign, and sas7bdat, respectively).

# If you like to "look" at your data, Rstudio can open a spreadsheet viewer,
# but really it's easier to look in the console
head(dat)  # only prints the first 6 rows

# We can also print just the column names using names(), the dimensions using
# dim(), or the number of rows or columns using nrow() or ncol()
names(dat)
dim(dat)
nrow(dat)
ncol(dat)

##### Data management with dplyr & tidyr #####

# Data management in base R is, frankly, a big headache.  We will use a new,
# contributed package called dplyr to manage the data.

# If you haven't already, install dplyr:
# install.packages("dplyr")
# This only needs to be done once per computer you use R on.

# To use a contributed package, you have to load it using the library command.
# This must be done in every R script you run that uses the package.
library(dplyr)

# dplyr breaks data management into (about) 5 simple tasks: subsetting by rows,
# subsetting by columns, sorting, and creating new variables

# To subset by rows, we use the filter command.  This creates a dataset with
# only smokers:
smokers <- filter(dat, PSMOKES == 1)

head(smokers)  # note that PSMOKES is only 1's

# To subset by columns, we use the select command.  This creates a dataset with
# only the columns for grade and smoker
grade.smoker <- select(dat, grade, PSMOKES)

head(grade.smoker)  # Notice there are only 2 columns now

# To sort, we use arrange
sorted.grade <- arrange(grade.smoker, grade)

head(sorted.grade)

# To create new variables, we use mutate
hs.smokers <- mutate(grade.smoker, hs = grade > 8)

head(hs.smokers)

# dplyr also provides the %>% (read: "pipe") operator, which allows you to
# string these simple commands together.  The pipe passes the output of the
# previous command to the first argument of the next command, or to a position
# that you specify with a period.  E.g., if you're sorting the data, these are
# equivalent:
head(arrange(hs.smokers, desc(grade)))
hs.smokers %>% arrange(., desc(grade)) %>% head(.)
hs.smokers %>% arrange(desc(grade)) %>% head

# I will use the latter construction from now on.  Let's recreate hs.smokers
# using chained commands:
hs.smokers <- dat %>%
  select(grade, PSMOKES) %>%
  mutate(hs = grade > 8)

# Recoding variables with mutate deserves extra attention, because it's really
# important.  The structure of a mutate command is 
# mutate(new_var1 = (vector1), new_var2 = (vector2), ...)
# So, to create a new variable, we have to create a vector with the new values
# to assign to it.  In the case above, grade > 8 created a logical (i.e., 
# TRUE/FALSE) vector indicating whether each element of the column grade was 
# greater than 8.

# The most common uses for this are:
# 1. calculating a new number from an existing number (or numbers),
# 2. creating categorical variables from numeric ones

# As an example of 1, let's calculate 3 variables: a logged variable, a squared
# variable, and an interaction of two numeric variables
# To modify a dataset "in place" -- meaning that you do not make a new object
# from the dataset, you can use the %<>% operator from the magrittr package.

# Install magrittr if you haven't already:
# install.packages("magrittr")
# Load magrittr
library(magrittr)
dat %<>%
  mutate(idg.log = log(IDG_SS), idg.sq = IDG_SS ^ 2, 
         noms.times.grade = totnoms * grade)

head(dat)


# As an example of 2, let's split IDG_SS into 3 categories.  To do that, we will
# split nominations into categories.

# First, an aside -- let's see what the distribution of IDG_SS looks like.
# To look at the distribution, we use ggplot2, a popular visualization package
# install.packages("ggplot2")
library(ggplot2)
qplot(x = IDG_SS, data = dat, geom = "histogram")

# Let's do 3 categories: 0 (isolate), 1 - 10 (average), 11 - max (popular).
# We'll use derivedFactor from the mosaic package for this.
# install.packages("mosaic")
library(mosaic)
dat %<>%
  mutate(popularity = derivedFactor(
    isolate = IDG_SS == 0,
    average = IDG_SS < 11,
    .default = "high",
    .method = "first"))

# Note that derivedFactor is quite slow on large datasets.  Other approaches to
# this include recode from the car package, or cut from base R.

# Let's check and make sure we did that right:
with(dat, table(popularity, IDG_SS))

# What does the distribution of this look like?
qplot(x = popularity, data = dat, geom = "bar")

# Fun with ggplot...
p <- qplot(x = popularity, y = IDG_SS, data = dat, geom = "boxplot") +
  coord_flip()

p

p + scale_y_sqrt()  # sqrt instead of log because of 0 values

p + theme_bw()

# dplyr also contains a number of other useful functions -- like the left_join,
# right_join, inner_join, etc., functions for merges -- that I will not cover
# here.  Read the vignette for dplyr to get a better idea of how to use these
# functions.
vignette("introduction", package = "dplyr")
vignette(package = "dplyr")  # view a list of all the vignettes for dplyr

# (Most packages have a vignette to explain how to use them.  It's often worth
# reading.)

##### Creating a network object #####

# Following best practices, we will begin by setting up the workspace:
rm(list = ls())
setwd("C:/Users/fishe/Box Sync/Home Folder jcf26/SNH2017/Instructor dropbox")  # Note: forward slashes
library(statnet)  # lots of messages...

# Statnet can create a network object from any of the common network data
# structures: edgelists, adjacency lists, and adjacency matrices.

# Let's start with an example of an adjacency matrix.  Here I'll define a
# network where a vertex is a member of my immediate family, and an edge
# indicates "parent":
people <- c("mom", "dad", "sister", "me")

(adj.mat <- matrix(0, nrow = 4, ncol = 4))  # create empty matrix
rownames(adj.mat) <- people
colnames(adj.mat) <- people

adj.mat["mom", "sister"] <- 1
adj.mat["mom", "me"] <- 1
adj.mat["dad", "sister"] <- 1
adj.mat["dad", "me"] <- 1

adj.mat  # completed adjacency matrix

# We can create a network object using the network() function
(family <- network(adj.mat, matrix.type = "adjacency"))

# Network objects have a number of special functions for them.  We will cover
# these in the next section.  For now, let's plot the network
plot(family)  # not terribly informative

# Note that we could have expressed the information above more compactly if we
# just listed the cell values that should be 1 (i.e., mom -> sister, mom -> me,
# etc.) and assumed everything else should be 0.  That's called an edgelist.
# We can generate a network from an edgelist, too:
(edges <- data.frame(from = c(rep("mom", 2), rep("dad", 2)),
                     to = rep(c("sister", "me"), 2)))
family.2 <- network(edges, matrix.type = "edgelist")

# We can show the adjacency matrix from the network object using as.matrix
as.matrix(family.2)  # order of the rows is different, but network is the same

# And finally, we could compress this even more by holding one the sender in
# one column, and the receiver in subsequent columns.  This is called an
# adjacency list.
(adj.list <- data.frame(parent = c("mom", "dad"), child1 = rep("sister", 2),
                        child2 = rep("me", 2)))

# Statnet doesn't handle adjacency lists well, but they are the most common type
# of raw survey network data.  Fortunately, we can convert from an adjacency
# list to an edgelist easily, with the tidyr package
library(tidyr)
library(dplyr)  # for the pipe operator
library(magrittr)  # for the %<>% operator

adj.list %>%
  gather(key = "nomination.number", value = "alter", -parent)

# Gather strings your data out into a long column.  We can select which
# variables to string out -- in this case, we wanted to keep the column for
# parent separate, and to repeat the values for each child.

# If you drop the middle column, you have an edgelist...
adj.list %>%
  gather(key = "nomination.number", value = "alter", -parent) %>%
  select(-nomination.number)

# ...which you can pass to network!
adj.list %>%  # note that we didn't save the output
  gather(key = "nomination.number", value = "alter", -parent) %>%
  select(-nomination.number) %>%
  network(., matrix.type = "edgelist")

# That's great, but how do we do it with a real dataset?  Let's try with the 
# faux Add Health data.
dat <- read.csv("ahs_wpvar.csv", stringsAsFactors = F)

# So what's in here, anyway?
names(dat)

# First of all, there are several communities in these data.  Let's pick the
# first one:
cmty1 <- dat %>% filter(commcnt == 1)

# Notice that the "friend" variables contain "fnid", and no other variables do.
# We can use that to create an edgelist:
el <- cmty1 %>%
  select(ego_nid, contains("fnid")) %>%
  gather(key = "nomination", value = "alter", -ego_nid)

head(el)  # hmmmm.... contains 99999 and NA values.  Let's standardize those
el %>% arrange(ego_nid) %>% head

el %<>% mutate(alter = ifelse(alter == 99999, NA, alter)) 

# Now you would think this would work as a network...
el %>%
  select(-nomination) %>%
  network(., matrix.type = "edgelist")

# but statnet also can't really deal with missing values in the edgelist.  So
# the best way to do this is to construct an empty network first, and then
# add the edges from the edgelist.  This has the advantage of including isolates
# too.

# To do that, first we get a list of the people who are involved in the network:


# Get unique list of people in the network
(ah.people <- with(el, unique(c(ego_nid, alter))))  # this would be all the people, including those out of school
# ah.people <- unique(dat$ego_nid)  # this is only people who took the survey

el %<>% filter(!is.na(alter))  # drop missing cases

# We're not going to do that, because it includes a lot of people who didn't
# take the survey.

# Now create an empty network
add.health <- network.initialize(n = length(ah.people))

# statnet adds edges by index values, not by names of variables.  So, to make
# this work, we have to create an index value of 1, 2, ..., N, where N is the
# number of people in the network
el %<>%
  mutate(ego.idx = match(ego_nid, ah.people), 
         alter.idx = match(alter, ah.people))


# Finally, add the edges from the edgelist
add.edges(add.health, el$ego.idx, el$alter.idx)

# Now, as a quick gut check, let's plot it:
set.seed(919)
plot(add.health)  

# (As an aside, note that you could create a list of all the networks using
# lapply...)

##### Save and recall network attributes #####
# One of the benefits of statnet is that it's possible to save attributes for
# different pieces of the network.  We can save vertex attributes, like whether
# the person is a smoker, edge attributes, like whether the ties is homophilous,
# or network attributes, like the name of the school, or the survey wave.  The
# procedure for this is simple; let's illustrate a few.

# First, network attributes are characteristics of the network as whole.  Set
# them with %n%
add.health %n% "cmty" <- 1
add.health  # note "cmty" in there

# Second, vertex attributes are characteristics of each vertex.  Set them with
# %v%
add.health %v% "sex" <- cmty1$sex  # we can do this because vertices and data 
# are sorted in the same way

# A special case of vertex attributes is vertex names, which you set with
# network.vertex.names()
network.vertex.names(add.health) <- cmty1$ego_nid

# All of these methods can also be used to retrive network attributes
add.health %v% "sex"

# Third, edge attributes are characteristics of a pair of vertices.  You can set
# them with %e%, but the indexing is more complicated, and I will not cover it
# here.

# These are useful for plotting and advanced statistical methods.
plot(add.health, vertex.col = "sex")

##### Managing lists #####
# data.frames are the R equivalent of a rectangular dataset in SAS, Stata, or
# SPSS.  R's strength, however, is in its flexibility in the types of objects
# it can store.  That's why it has become so popular for dealing with complex
# data sources, like networks or text.  The data structure underlying all of
# these objects is a list.

# A list is exactly what it sounds like -- it is an ordered collection of
# objects.  Those objects can be anything, including other lists!

my.list <- list(letters[1:10], dat, "all the things")
my.list

# We can refer to items in a list by number, using double brackets
my.list[[2]]  # dat

# We can take more than one at a time using single brackets:
my.list[1:2]  # letters[1:10] and dat

# Or we can name the elements in our list, and then refer to items by name
names(my.list) <- c("some_letters", "dat", "all_things")
my.list$all_things
my.list[["all_things"]]  # same

# Why is this important?  For starters, a data.frame is really just stored as
# a list of columns
dat$grade  # prints just the column "grade"

str(dat)  # view the data structure for dat

# This is helpful if we want to calculate values on that column
mean(dat$grade)

# It's also useful, because R can loop (i.e., perform the same action
# repeatedly) easily with lists.

# A typical loop looks something like this:
for (i in 1:3) {
  print(my.list[[i]])  # prints each element of my.list, one at a time
}

# But that's speaking R with a strong C (or SAS) accent.  The R way of doing
# this is to use the *apply family of functions.  Each one of them takes a
# function, and applies it to each element of the list.  They are named with a
# prefix that indicates what kind of output you'll get -- lapply returns a list,
# and sapply returns a simplified object.

# Let's replicate the example above.
lapply(my.list, print)

# Now let's take a more practical example -- what if I wanted to calculate the
# mean for each of the columns in dat?
lapply(dat, mean)
sapply(dat, mean)  # note that this simplifies it into a vector
sapply(dat, mean, na.rm = T)  # we can pass addition arguments to the function
dat.means <- sapply(dat, mean, na.rm = T)  # save output

mean(dat$sdummy)

# Equivalent
dat.means <- c()  # create empty vector for storage
for (i in 1:ncol(dat)) {
  dat.means[i] <- mean(dat[[i]], na.rm = T)
}

# Notice the single brackets on the output.  We're refering to a specific 
# element of a vector by its index (this is called indexing).  Indexing is
# another way to do recoding.
# For example, let's recode grade == 0 to NA (R's special character for missing)
dat[, "grade"]  # prints the column for grade
dat$grade  # equivalent

dat$grade == 0  # prints a logical (T/F) vector indicating when grade is 0
which(dat$grade == 0)  # prints the index of which values are TRUE

dat[dat$grade == 0, "grade"]  # prints just the elements of "grade" that are 0
dat$grade[dat$grade == 0]  # equivalent

dat[dat$grade == 0, "grade"] <- NA  # resets those values to NA
dat$grade[dat$grade == 0] <- NA # equivalent

# You would repeat this for each category.  (This is cumbersome compared to 
# dplyr!)

# The *apply functions are useful whenever you want to loop through several
# things.  For example, here's how you would run several different model
# specifications for a linear model.

# One linear model looks like this:
lm(POP_BEH ~ IDG_SS, data = dat)

# (For the record, a logistic regression looks like this:)
glm(PSMOKES ~ IDG_SS, data = dat, family = binomial(link = "logit"))

# Let's specify a list of models:
models <- c(  # Again, c stands for 'closure' -- it's a way of making a vector
  POP_BEH ~ IDG_SS + grade,
  POP_BEH ~ grade,
  POP_BEH ~ sex
)

# Now, to run all the models, we use lm with lapply
(fits <- lapply(models, lm, data = dat))

# Print the summaries for both the models:
lapply(fits, summary)

# Ta-da!  The object that we saved is a list of fitted model outputs.  You could
# similarly run the model using different subsets of the data:

sex.models <- lapply(1:2, function(x) lm(POP_BEH ~ IDG_SS, data = dat, 
                                         subset = sex == x))

# The next section will explain how that last lapply statement worked.

##### Functions ##### 
# The other reason that R has become so popular is that it is very easy to build
# your own functions.  Unlike SAS macros (or, to a lesser extent, Stata ado
# files), custom R functions are treated just like the built-in functions, and
# have a centralized distribution system (CRAN).  This has made it very popular
# for statisticians, who use it to distribute software to do cutting edge
# analyses.

# You can, and sometimes should, build your own functions to make your code 
# easier to read and maintain.  The underlying goal is DRY: don't repeat
# yourself.  If you find yourself copying and pasting code often, or if you
# have trouble parsing what a nested set of commands means, you should probably
# be writing a function.

# At their core, functions take an input, and produce an output.  We call input
# the set of arguments that the function takes, and we call the output the
# return value of the function.  

# Here's a simple function to calculate the mean manually:
newMean <- function(x) {
  # Calculates the mean of a numeric vector
  #
  # Args:
  #   x: a numeric vector
  #
  # Returns:
  #   the arithmatic mean of x
  return(sum(x) / length(x))
}

newMean(1:10)
mean(1:10)  # same

# Because R is open-source, you can view the code for almost any function.  For
# example, look at the code for lm:
lm

# Virtually all functions come with a help file, too.  Access that with:
?lm  # shows arguments, return values, etc.

# Of course, it's typically unwise to recreate existing functions, if for no
# other reason than that it's time consuming.  Creating functions is useful
# because you can create custom output.  For example, let's create a function
# that produces a PROC MEANS style summary of a numeric variable (mean, sd, min,
# max)
procMeans <- function(x) {
  # Creates a data.frame with the mean, sd, min, and max of a numeric variable
  #
  # Args:
  #   x: a numeric vector
  #
  # Returns:
  #   a data.frame, with columns for the mean, sd, min, and max of x
  return(data.frame(mean = mean(x), sd = sd(x), min = min(x), max = max(x)))
}

# Running this on one variable of dat:
procMeans(dat$IDG_SS)

# But, since a data.frame is just a list of columns, we can use lapply to loop
# through a bunch of columns at once:
lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans)

# Not quite what we wanted -- let's pipe the result to the bind_rows function
# from dplyr, which squashes together the data.frames by rows
lapply(dat[c("IDG_SS", "POP_BEH_SS", "PSMOKES", "grade")], procMeans) %>%
  bind_rows

# We will see that this method is useful when we are calculating a series of
# statistics on networks, or on individuals in networks.