Author Archives: Jin Tong
Old topic, but still necessary
An email from Julie,
Hi Jianying, Here is the info you need. There are 4 different samples, all were done in biological triplicate. All tissue is liver, taken from the left lobe. LN = liquid nitrogen - tissue frozen in liquid nitrogen PFPE_tissue = liver tissue fixed by paxgene method OCT_LCM = liver tissue frozen in OCT, sectioned and regions laser microdissected PFPE_LCM = liver tissue fixed by Paxgene method, embedded in paraffin, sectioned and regions laser microdissected. Different kits were used for RNA extraction of LN, PFPE/PFPE_LCM and OCT_LCM. 3 kits used. Isolations done on 3 different days. Let me know if you need more information and thank you very much. J.
Honestly, there is NOT much biology in the study, but really testing the sample preparation procedure. So, standard QC will help to answer it.
ChromHMM — by Jason Ernst
Okay, as the of my effort for Paul’s project, I am looking into this software. ChromHMM.
Protected: Paralleling R processes
Pipeline: Enhancers & Genes
There was a meeting called by Dr. Paul Wade, his goal was to “associate a chromatin state to its regulated gene (expression level) changes”.
Here are two papers Ernst’s Nature paper in 2011 and Ernst’s Nature biotech in 2010 are where we could start off with.
In Paul’s mind, a pipeline will be an ideal situation. To build such a pipeline, I can think about the following component
1. Ernst's HMM model 2. Further statistical model
To learn this, we should clear out the following road block
1. Can we reproduce what these two paper proposed? 2. Can we take our data as input? Do we need any data cleaning?
These being said, to get a pipeline, we should modularize the processed into a viable components.
1. Data cleaning module 2. Chromatin state identification module 3. RNAseq/MicroArray gene expression module 4. Association module
A good R package on mixture models
Here is the mclust package.
Working on literature search
Found a very good link that links to a statistics department at Utah State University.
Basic procedure for building the prediction model
The data we have will be mixture of the two (sometimes three) different populations.
Our goal is to fix the mixture model Get critical measurement of any given sample — a vector of “density/frequency/proportion” along the possible x coordinates. Then use such a vector as independent variable Build statistical model with clinical outcomes as dependent variable Apply the model on the training data and predict on the unknown data
Challenges exist in this research project:
Out of three main clusters, information from normal group masks the other two Two clinical diagnosis: OLK vs. normal, can have similar group three cell population. i.e. with D.I. > 2.3 for a few cells, or no cell with D.I. > 2.3 at all. These samples were diagnosed with further histo-pathology diagnosis So, it is challenging and this can become opportunity also. Can we show higher correlation between OLK and OSCC than that between normal OSCC? Are the measurement we “extract” critical? Is there any statistical violation (with this analysis)?
Opportunities exist in this research project, and my hope:
Get a good fit on data from a given sample Identify critical points (first or second derivative) if any, zero otherwise Build SVM or other models based off these transformed data Evaluate the model with cross-validate, 10-fold or leave-one-out cross validation Create ROC curve for model evaluation Test on independent dataset (hopefully coming soon)
It turns out that there are much I need to work out for this project
Other than the mixture model, which can not apply I need to choose alternative ways Fit a predominant simple model and determine the excess <– cause by other phenomenon From the data, we can easily get CDF and compare the fit to the original data Get the deciles, maybe? Thanks go David Umbach, but the reality is it is easier to say than hand-on implementation. Need to think about it and understand it more.
I found a very interesting PIA at UChicago, and next I need to find what software they are using and whether it is free. It turns out that the webpage was out-of-date, and here is the new one. Talked to Lei-Ann and got great information.
Protected: Rat microRNA body map project
Mixture model by a Canadian
I think that I had found the original post from a Canadian group.
Besides Peter McDonald and his contribution to the MIX software, I found out a Juan Du Master’s student’s thesis. And there is a mixdist R package
I am amazed to find out that mixture of models have been intensively study by so many researchers and for so many years! David Dowe. It is often an paradox, how much deep can an individual go into. I guess that is how an person can become an expert.
Now, with the most superficial approach, I need to clear out some basic road blocks:
Detail properties of normal and gamma distribution How gamma becomes a normal distribution Chi-square test on goodness-of-fit and degree of freedom
With the general MIX program, it fits a set of data with “mixparameters” and proposed “kernels”, then it come back with a fit to the “histogram”, with chi-square test on the fitting. It should report parameters of those distributions that make up the mixture data. It might sounds like a good approach:
With a set of data coming from a mixture distributions Fit with MIX or mixdist and assess the fit with Chi-square test Pick whichever winners and extract the parameters from those distributions Then, restore the mixture distribution with known parameters (and/or the proportions??) In the end, take the derivatives (second) and finish the data transformation
The next topic will be SVM or any other clustering procedures for modeling and building the prediction model.