########################################################### # ************************************************ # # Social Networks and Health Training Program 2018 # # ************************************************ # ########################################################### #**************************************************************************# #From Raw Data to Network Objects: Data Cleaning for Social Network Analysis #**************************************************************************# #Author: Maria Cristina Ramos - Duke University ################################################################## # INTRODUCTION: DATA CLEANING FOR SOCIAL NETWORKS ANALYSIS (SNA) # ################################################################## #Start by clearing old data rm(list = ls()) #Running a command line in R: place cursor in line and ctrl+enter (non MAC), cmd+enter(MAC) no need to highlight. You will move to next line of code. #If you do not want to move down a line alt+enter #If you want to execute just a piece of the line, highlight only that piece #and ctrl+enter #use # for commenting #to comment more than one line at a time, highlight the code, ctrl+shift+C at #the same time. #Data cleaning for SNA: planning and executing a series of tasks #that transform raw data into objects that SNA tools will be able #to analyze. #Statnet and igraph are commonly used network analysis tools for R. #SNA using statnet or igraph often starts with the creation of a #network object from scratch. #NOTE: statnet and igraph are incompatible. You cannot use both statnet and #igraph at the same time. If you want to use both of them, you have to do it #sequentially. Unload one so that you can load the other. #to unload them use detach(package:statnet) or detach(package:igraph) #This tutorial uses statnet. The data cleaning process is essentially the #same for both statnet and igraph. #IMPORTANT: load statnet library(statnet) only once you are ready #to construct a network object. Otherwise it will mess up with some #data inspection functions. #To create a network object with statnet, we need our data to be #in a certain format. #There are four possible formats from which statnet can construct a #network: # - Adjacency matrix m <- matrix(rbinom(25,1,0.5),5,5) colnames(m) <- c("Jim", "Molly", "Liann", "Jo", "Jaemin") rownames(m) <- colnames(m) diag(m) <- 0 m # - Edgelist elData<-data.frame( from_id=c("1","2","3","1","3","1","2"), to_id=c("1", "1", "1", "2", "2", "3", "3"), myEdgeWeight=c(1, 2, 1, 2, 5, 3, 9.5), stringsAsFactors=FALSE ) elData # - Incidence matrix inci<-matrix(c(1,1,0,0, 0,1,1,0, 1,0,1,0),ncol=3,byrow=FALSE) rownames(inci) <- c("Jim", "Molly","Dana", "Liann") colnames(inci) <- c("e1", "e2","e3") inci # - Bipartite network m <- matrix(rbinom(25,1,0.5),5,5) rownames(m) <- c("Jim", "Molly", "Liann", "Jo", "Jaemin") colnames(m) <- c("Baseball club", "Chorus", "Volunteering", "Debate club", "Writing group") diag(m) <- 0 m #plus list of nodes with attributes (features such as gender, race, etc.) #This workshop outlines a series of systematic steps to reduce the #uncertainty that often comes with working with R and make the #process easier! #From paralyzed to deliberate! ################ # THE PROCESS # ################ #1. Inspect/Evaluate Raw Data #2. Make a Plan #3. Construct Network Object #4. Check your Work ########################################### # PHASE 1: INSPECT/EVALUATE YOUR DATA # ########################################### # Goal: identify your data manipulation needs. # a) Identify useful bits for network objects # - What pieces will be the node ids? # - What pieces can be node attributes? # - What pieces will be the edges? # - What pieces can be edge attributes? # b) Identify issues we have to deal with # - Issues at the observation level # -Extreme, nonsensical, and/or missing values. # - Issues at the structure level # - Unnecesary columns: columns that do not contain useful info for # our analysis. # - Preview rows (Qualtrics): apparent responses that are actually # you testing the survey. # - Rows/columns in the wrong format: columns that should be rows, # rows that should be columns. Columns that should be combined, etc. # TIP: Don't start manipulating (moving, joining, etc.) anything until you # took a good look at your data. Get a good sense of what you have first. #In the long run, it will be more efficient. ########################################### # PHASE 2: MAKE A PLAN # ########################################### # Goal: come up with a task or series of tasks to solve each issue. No code # involved yet, just your plan. This will make the subsequent coding tasks # way easier. # a) Plan changes at the observation level # - Remove missing values? Impute missing values? Recode? # This decision depends on the source of the problem (e.g. a coding # error versus a truly extreme value). # b) Plan tasks for restructuring the dataset # - First we need to choose the type of input structure for # network objects: # - Adjacency matrix # - Edgelist # - Incidence Matrix # - Bipartite Network # - + Nodelist # - Then, describe series of tasks to restructure your current # dataset into the network object input (your roadmap) ########################################### # PHASE 3: CONSTRUCT NETWORK OBJECT # ########################################### # Goal: Implement your plan # You implement your plan by finding a function that will # do one or more of the tasks in your plan. # KEY: Having ALREADY identified the type of object/data # you need to manipulate and what task you want to do with # it will make it easy to find a suitable function for your tasks. #For example, I want to combine columns. Object: columns; task: combine #Having identified these I can now google for a function that combines columns. #I can now go look in a package created to manipulate columns. #Before I was just paralized not knowing what to do. # There are packages for specific types of objects/data # (e.g. stringr for strings, tidyr for rows and columns) # Where can you find suitable functions for your tasks? #a) Cheat sheets! # https://www.rstudio.com/resources/cheatsheets/ # see also our dropbox folder for relevant cheat sheets # Advantages of cheat sheets: # - You build familiarity with the package. # - They have visuals! Easy to see what the function will do. # - Quick, big picture of what is available. # # b) Package vignettes or manuals #vignette(package = "dplyr") #list of topics for which there are vignettes about dplyr #vignette("dplyr", package = "dplyr") #first argument "dplyr" was listed as Introduction to dplyr in the list of topics, so we call for it. # https://cran.r-project.org/web/packages/network/network.pdf # this is the package we will use to create our network object. # c) Googling it ################################# # PHASE 4: CHECK YOUR WORK # ################################# # Goal: Make sure the new network object **makes sense** #Look at basic descriptives of the network and see if they fit with #what we know about our raw data. #Now let's go through each phase in more detail by working with an example. ############################################# # EXAMPLE: DATAMED - Nomination Network # ############################################# #Installing and loading packages #install.packages("openxlsx") #I think this one you have to install. #either comment or remove the line of code that installs a package right #after installing it. library(openxlsx) library (dplyr) library (tidyr) library(stringr) #Importing Data #This is a synthetic (fake) dataset I created for the purposes of this lab. #We have two files. The first file contains responses to a #qualtrics survey. The second file is a person codebook. #"Respondents" are medical professionals and were asked to list up to ten #other professionals with whom they discuss professional matters. #File 1: Field data from Qualtrics setwd("/Users/mariacristinaramos/Dropbox/Ongoing Projects/RA for Jim/Data Cleaning Workshop/") list.files() #this lets you see what's in your working directory datamed <- read.xlsx("datamed_raw.xlsx") #File 2: Person Codebook pcode <- read.xlsx("datamed_raw.xlsx", sheet = 2) ################################## # PHASE 1: INSPECT YOUR DATA # ################################## #2 Steps #Step 1. Understand the Structure #Step 2. Understand at the Observation level #1. Understand the Structure of your Data ######################################### #a) Check the class of the dataset. class(datamed) class(pcode) #You can use class with any type of R object #Most data come in tabular format, but still it is good to check. #We can already see that we have more nodes in our person codebook (225) #than responses to the survey (217). #b) view the column names to get a first sense of what you have. names(datamed) names(pcode) #RED FLAG: some of the column names are not very informative. #We look at the survey and find out that #Q32 = R's Name #Q33 = R's Last Name #Q.1 - Q.10 = R's Nominees #Q34 = R's Gender #Q35 = R's Age #Q36 = R's Area within Medicine #Q37 = R's Hospital Affiliation #c) See a compact summary of the data str(datamed) glimpse(datamed) #dplyr's version of str() str(pcode) #str stands for structure #The str() function tells you the dimensions of the data. #In addition, the str () function tells you the class and #first observations of each variable. #The str() function is particularly useful to identify bits of #information you can use in your analysis. #In conjunction with your codebook or survey, you can identify #variables that contain information that we can use for #constructing our network. #str() is also useful to identify NAs #Note R encodes missing values as NAs. Look for other common #missing values encodings in your data (-1, 9999, N/A). #RED FLAGS: #unnecessary columns #previews in Distribution Channel. #R's name and last name in different columns. #R's name and last name (IDs) are strings, nominee IDs are numbers #NAs in attribute variables #d) look the first and last rows #head() lets you look the first rows of a dataframe. head(datamed) #first 6 rows (6 rows is the default) #head(datamed, n=10)#first 10 rows. The n= tells R how many rows #you would like to see head(pcode) #tail() lets you look at the last rows of the dataframe tail(datamed) #last 6 rows #tail(datamed, n=10) # set it to show you the last 10 rows. tail(pcode) #these functions let you see your dataset without #clutering the console. This is useful to detect #whether your dataset is sorted in a particular way and to get a #deeper understanding of what each variable contains. #NOTICE: fewer NAs in attribute vars in tail. Could it be that those previews #are the NAs? #e) Summary of the distribution of each variable summary(datamed) #For numeric variables, this means looking at means, #quartiles (including the median), and extreme values. #Summary will produce different summaries #depending on whether you are dealing with a character #or factor variable. ##Summary helps reveal unusual or #extreme values, missing values, special characters,etc. #Summary tells you the number of NAs in each variable #in the dataset. #Also useful for assessing whether you will have enough #variation in some variable to effectively use it as #an attribute for your network objects #RED FLAG: If we know the max. node_id is 227, we shouldn't have values #greater than 227 in the nominations. However, we see that Q.1 has #a max value of 609. #2. Look more closely at Observations ##################################### #Should do this for each var of interest #a) Tables for each variable of interest #Important to find unusual observations, special characters, NAs, and #variation in categories to see which attributes could be more distinguishing #features. #simple table table(datamed$DistributionChannel) #we have 5 preview observations table(datamed$Q34, useNA="ifany") #IMPORTANT: $ selects a specific variable from the dataset. #useNA="ifany" includes NAs in your table table(datamed$Q36, useNA = "ifany") #RED FLAG: those Unknowns in Q36... 11 Unknowns table(datamed$Q37, useNA="ifany") #RED FLAG: only 3 people from City Hospital. Probably not a great distinguishing #attribute. #crosstab with(datamed, table(Q36, Q37, useNA = "ifany")) #b) Visualization hist(datamed$Q35)#histogram boxplot(datamed$Q35) #boxplots are good for detecting outliers. #ok. There are outliers. How do we find them? which.max(datamed$Q35) #Which row contains the max value for Q35 which(datamed$Q35>60) #which rows contain values above 60 for Q35? #see if we have any NAs in our person code sum(is.na(pcode)) #We have looked at our data. #Now we have a sense of the data structure and the issues we need to deal #with. #We have a better sense of what variables will be useful to keep for our #network objects,how we should name those variables, and whether we have #missing data issues we should deal with. #################################### # PHASE 2: MAKE A PLAN # #################################### #Remember: no code involved, just the plan. #A. Plan for dealing with unusual observations ############################################## #When dealing with unusual values in your data, #you often must decide if they are just extreme or erroneous. #1. Remove preview observations and keep observations with NAs #in attribute vars. #2. Remove the 609 in Q.1 #3. Turn the "Unknown"s in Q36 into NAs #B. Plan for constructing network ################################# #1. Choose the type of input structure for network objects: #It seems like the easiest option is to go for an edgelist. This #might vary according to dataset. #2. Describe series of tasks to restructure your current dataset into the network #object input (your pseudo code): #A. Fixing Node-id #***************** #Start by fixing the node_id issue since we will use that info in both our #node list and edgelist #Step 1. Join name and last name in pcode file. #Step 2. Join Q32 and Q33 (R's name and last name). #In this way we can match the two datasets with R's complete name. #Step 3. Replace ID in our new nodelist dataframe with code from pcode. #B. Constructing nodelist #************************ #Step 1. Create nodelist dataframe. #Step 2. Store new id, gender, area, and age cols in new dataframe. #Step 3. Rename cols in nodelist dataframe #C. Constructing edgelist #****************************** #Step 1. Create edgelist dataframe. #Step 2. Store new id and nominations in new dataframe. #Step 3. move nominations from wide to long format #Step 4. Remove NAs #Step 5. Rename cols sender and target. #3. Create network object using our nodelist and edgelist #as input #4. Add attributes ########################################## # PHASE 3: CONSTRUCT NETWORK OBJECTS # ########################################## #Every task you do follows the same process of data cleaning #but in a small scale: you look at your data, you make a plan, you implement #it, and look again to check. #1. Remove preview observations names(datamed) edgedata<- filter(datamed,DistributionChannel!="preview") head(edgedata)#no more previews. Always, always check what you did. #2. Remove the 609 in Q.1 edgedata$Q.1[edgedata$Q.1==609] <- NA summary(edgedata) #3. Turn the "Unknown"s in Q36 into NAs edgedata$Q36[edgedata$Q36=="Unknown"] <- NA table(edgedata$Q36, useNA = "ifany") #A. Fixing Node-id #***************** #Step 1. Join name and last name in nodelist file. #a useful function to join the contents of multiple columns? #?unite #TIP: use ?functionname when you know of a function that might work, #but you are not sure of what its arguments are head(pcode) pcode<- unite(pcode, "name", name:last_name, sep=" ", remove=TRUE) head(pcode) length(unique(pcode$name))#because we have 227 unique names, we #know no names were repeated #sep="" very important. Tells R what should be between the pieces you #will unite #Remove: whether the original columns you united should be removed or not. #Step 2. Join Q32 and Q33 (R's name and last name). str(datamed) edgedata<- unite(edgedata,"name",Q32:Q33, sep=" ", remove=TRUE) head(edgedata) #Step 3. match name in new dataframe with name from node file. edgedata<- right_join(pcode,edgedata, by="name") head(edgedata) pcode[pcode$node_id==95,] #looking at whether 95 corresponds to Marilyn #Hulett in the node file to see if our matching worked correctly. #We could have done this more efficiently by doing Step 2 and 3 together #with pipes: #edgedata<- unite(edgedata,"name",Q32:Q33, sep=" ", remove=TRUE)%>% #right_join(pcode,edgedata, by="name") #pipes are distinctive of dplyr. So you need to install the package. We did #that at the beginning of the code. #Pipes *pass* whatever is on the left to the right. This is different from #the usual R logic that goes from right to left. #Usual R logic: a <- select(b,1:10) a equals a selection of b, its columns #from 1 to 10 #Pipes: a <- b%>%select(1:10). a equals: take b then select its columns #from 1 to 10. #TIP: verbalizing your code can help. #B. Constructing nodelist #************************ #Step 1. Create nodelist dataframe by joining our person codebook with the #dataframe containing the attribute information. We need to fully join our #datasets instead of only using our datamed dataset because the datamed dataset #does not contain all the nodes since not everyone replied to the survey. #Step 2. Store new id, gender, area, and age cols in new dataframe. #Step 3. Rename cols in nodelist dataframe str(edgedata) nodelist <- full_join(pcode,edgedata, by="node_id")%>% select(node_id, Q34:Q36)%>% rename(gender=Q34, age=Q35, area=Q36) head(nodelist) dim(na.omit(nodelist)) #To figure out from how many people you are missing #at least one attribute. #C. Constructing Edge list #****************************** #Step 1. Create edgelist dataframe. #Step 2. Store new id and nominations in #new dataframe. #Step 3. move nominations from wide to long format #Step 4. Remove NAs #Step 5. Rename sender and target. str(edgedata) edgedata <- select(edgedata, node_id, Q.1:Q.10)%>% gather(label, target, Q.1:Q.10, na.rm=TRUE)%>% select(-label)%>% rename(sender=node_id)%>% arrange(sender) head(edgedata) #### new steps (select) came given the function we used, but #having a plan made it easier to move forward #3. Create network object using our nodelist edgelist #as input library(statnet) mednet <- network.initialize(227, directed = TRUE)#network initialize #is very useful to make sure you include isolates. mednet <- network.edgelist(edgedata, mednet) mednet<-network.edgelist(edgedata,network.initialize(227),ignore.eval=FALSE) #4.Add node attributes # Note: order of attributes you add must match vertex ids # otherwise the attribute will get assigned to the wrong vertex #This is how we see vertex names mednet %v% "vertex.names" #yep, in the correct order. mednet %v% "gender" <- nodelist$gender mednet %v% "age" <- nodelist$age mednet %v% "area" <- nodelist$area ################################## # PHASE 4: CHECK YOUR WORK # ################################## #quick check mednet dim(edgedata) #same number of edges #checking attributes summary.network(mednet) #Now compare distributions of vertex atttributes head(nodelist) #gender table(datamed$Q34) #area table(datamed$Q36) #age hist(datamed$Q35) plot(mednet) #Do we see isolates? #Does it make sense? ########################################### # TIPS: WHEN SOMETHING DOESN'T WORK # ########################################### #1. Try to understand the error message or look for clues in it. #2. google the error. #3. use help to see if the arguments you used are not right or not in the #right format #4. use the package manual. toy examples and make substitutions until you #find what the function didn't like #For more information #?network #for information about creating networks using the network package. #?attribute.methods #for information about setting, modifying, or deleting network, vertex, or edge attributes. ```