Dr. Hao Chen’s microarary project

1. Met with Dr. Hao Chen for her Microarray data analysis, the strategy to approach this request is as following:

a. Read to two references in the paper by Talor, J.M. et al 2009, PLoSOne (ref. 46 & 47) for pattern extraction analysis
b. Extract 27 possible patterns from the esophagus developing stage related genes, produce figure 2 like results in Talor’s paper
c. Will provide consultation for clustering analysis for Hao
d. Discussed the authorship for possible publication

Got it into GeneSpring GX11.0
1. With help from JP from Bioinformatics center I was able to enter the data into GeneSpring GX11.0
2. Need to create the technology first and follow the self-explanary steps
3. Create, project -> experiment
4. Do the analysis within GeneSpring

Combining data from preview study (further request from Dr. Chen, Xiaoxin)

Goal: To determine what genes (especially transcription factors) may play critical roles at each stage during esophageal development.

1. E8.25: “E8.25 definitive endoderm” will develop into future esophagus, stomach, liver, intestine… Presumably some genes should be overexpressed and some others underexpressed in oder to “specify” differentiation into E11.5 esophagus.
2. E11.5: Simple columnar epithelium, 1-2 layers
3. E15.5: Stratified squamous epithelium, 2-3 layers
4. P0: Stratified squamous epithelium, Early keratinization, 3-5 layers
5. P7: Stratified squamous epithelium, Late keratinization, 3-5 layers

Two kinds of array data are available for analysis:
1. Previous arry data: E8.25 endoderm, E11.5 esophagus (Sherwood RI, Chen TY, Melton DA. Transcriptional dynamics of endodermal organ formation. Dev Dyn. 2009 Jan;238(1):29-42). Data can be download from EO datasets (GSE13040 record). There are 3 samples of “E8.25 definitive endoderm” and 3 samples of “E11.5 esophagus endoderm” in this dataset. Illumina mouseRef-8 v2 microarrays were used in this study.
2. Hao’s array data: E11.5 esophagus, E15.5 esophagus, P0 esophagus, P7 esophagus (3 samples of each). Initially we planned to include adult esophagus. Due to our own mistakes, we decide to move data analysis forward without these samples. You can access our original data from UNC. Hao gave you SAM data. I guess after we pool E8.25 data from a previous study, we need to do analysis again.

A few issues for you to consider:
1. How to pool these two kinds of data together into a speardsheet containing 15 samples (5 time points, 3 samples each time point)?
2. SAM: Do we need to do SAM before clustering? Seems to me that clustering is good enough to show tendency of gene expression changes.
3. Hierachical clustering: Probably two categories (low and high) is enough instead of 3 categories (zero, low and high). We are interesting in genes which increase over time, or decrease over time.
4. K-means clustering: I do not know what additional information we may get from K-means.

What we would like to have in the end are:
1. A figure to show that our array data are of high quality (hierachical clustering)
2. Clustering figures to show gene expression change over time, and corresponding spreadsheets
3. Genes which define each time point

New Milestone — an email from Dr. Chen, Xiaoxin proposing divide and conquer straregy

Jianying,

Hao and I have discussed your PPT and 16 patterns. We can make some sense out of Pattern 1 to 6. We are not fully satisfied because some known critical genes (Cdx1, Cdx2, pax9) did not show up. I guess after we merged two datasets we lost a lot of data. So, let’s go for Plan B in our paper as follows:

1. We first show Agilent data (E11.5-E15.5-P0-P7, 38,290 rows).

a. patterns
b. pathways (GSA, GSEA, etc)

2. We then show Illumina data (E8.25-E11.5; 24,189 rows)

a. SAM analysis
b. pathways (GSA, GSEA, etc)

3. Finally we show E8.25-E11.5-E15.5-P0-P7 (5,652 genes)

a. merging two datasets
b. Patterns 1 to 6

For Part 1, you worked a little bit on Hao’s SAM data. It may not be good enough. You may start with original array data. For Part 2, the other group did a little bit data analysis. But their major focus was not esophageal development. Their data analyis was not optimal in finding genes/pathways which are critical for esophageal specification. For Part 3, you have finished all. Any questions please feel free to discuss.

Xiaoxin

Met with Xiaoxin, Hao, and ?

1. Adjust delta value according to q value, done but need to automate the process.
2. Get new gene list (pending on Xiaoxin’s response)
3. Fisher’s exact test (pending on Xiaoxin’s response)
4. Gene list sent out put from SAM analysis, email xiaoxin priorto down stream analysis

Pending information from hao

1. four more arrays,
2. swap sample info

Action items:
1. Talk to JP for gene spring license issue
2. GSA/GSEA analysis
3. Profiling plots…

As of May 18th,

1. Was able to have three DBs work for our GSA (I may need to explore the OVR on the Agilent data (four stages) as I am not quite confident on the alleged “mulitclass” advantage (by Tibshirani -:) in GSA (vs. GSEA)
2. Got the code worked for clustering
3. BUT, I ran into memory issues with R (on my computer). That said, just for the “viewing” and/or “detecting mislabeling”, can I may just use partial genes? ( I used to try this out whenever I ran into memory issue). But, will figure out an inclusive solution for this in the future.

Things are pending:

1. GeneSpring (have not got a chance to work on it yet)
2. Will test EPIG and make sure it works on this small laptop prior to having a more power windows server set up.
(As we all agreed, this is not critical and should not hinder us from advancing with our paper writing)

Recommendation:

Maybe we should start to layout what we have and start to draft the manuscript. I bet that we may need more (hopefully not significant more) when we start to write.

Opps! Possible mislabeling!

Thank you so much for your prompt response. Xiaoxin and I went through the treeview file and feel that we should separate the samples of 35 and 36. Would you plz cluster 35 samples and 36 samples individually? And exchange 36A1 and 35D3 before you do clustering. We will get 2 cluster fliles: one for 35 and another one for 36. We hope the results would be more organized.

Best,

Hao