Skip to content

My Project…

By: Jo Zhu

The research of Hartemink group is at the intersection of computer science, biology (genomics) and statistical science. While the group studies a diverse range of topics, from transcription regulation networks, to control mechanisms of Eukaryotic cell cycle, they all share in common the use of Bayesian Statistics, Machine Learning and data mining in solving complex problems in biology.

Identifying binding sites of Transcription Factors is a crucial first step in understanding transcription regulation. My secondary mentor, Kevin, is working on using Dnase Digestion data to accurately identify Transcription Factors Occupancy in human genome across different cell types. These quantitative data would allow researchers to investigate systematically the TF competition and cooperative binding, cell-type-specific TF occupancy, and add to the understanding of transcription regulation network. He developed a model, named MILLIPEDE (available at http://www.cs.duke.edu/~amink/software/) which utilizes supervised learning strategy by training logistic regression models. The data used to train the model, i.e. Dnase digestion data, ChIP-Seq* and PWM scores etc. are obtained from open recourses of the ENCODE project.

And my project is applying another similar model, (named CENTIPEDE, developed by Pique-Regi R et.al.) to predict TF occupancy in human genome across different cell types and evaluating its performance via correlation analysis. Unlike the MILLIPEDE model, it is developed using Bayesian Mixture Model and unsupervised learning strategy, i.e. it doesn’t depend on Chip-Seq* data in training the model. The purpose is to serve as a comparison or a benchmark to the performance of the MILLIPEDE model.

Some background on DNA footprinting and how it is used to predict TF binding:

It is understood that the binding of transcription factors protects DNA from being cleaved by nuclease (in this case, deoxyribonuclease I). This feature makes the digestion profile reflective of the chromatin state of the genome, with nucleotides bound by a TF (or other proteins) being cleaved less frequently than unbound nucleotides. Aligning all fragments resulting from partial digestion to the genome and add up the counts at each nucleotide location therefore would result in a unique profile. This TF generic Dnase digestion profile tells us the likelihood for transcription factor binding. The difference between digestion profile for bound and unbound DNA fragment is clearly reflected in the figure below, with green indicating the bound and red unbound. (Source: http://genome.cshlp.org/content/21/3/447 Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Genome Res. 2011 Mar;21:447. )

But how to determine which TF?  TF’s bind to DNA sequence with certain specificity—this binding motif is represented by Position Weight Matrix. Therefore, PWM score is used to reflect the resemblance of the candidate sequence to that specific motif.

Putting them together, both models first scan the genome for sites that match the binding motif of the TF, and these would be the ‘candidate binding sites’. Then, the dnase digestion profile around each candidate binding site would be analyzed along with other relevant information such as distance to transcription start site, to determine the probability of TF binding.

Procedural-wise, since the r-scripts for the model and some example data (NRSF Transcription factor in GM12878 Cell line) are available (at http://genome.cshlp.org/content/21/3/447), I’m starting from understanding and learning how to use the model by replicating the results. After I get the exact same results, I will run it in different cell types using data from ENCODE project to analyze the performance of the model, which would serve as a comparison to the results obtained from MILLIPEDE model.

 

* ChIP-Seq (chromatin-immunoprecipitation followed by massively parallel DNA sequencing) is a technique to study the interaction of a specific protein with DNA. In our project, Chip-Seq data is assumed to accurately reflect the binding sites for each specific TF. Therefore, for MILLIPEDE model, it’s part of the training set in the supervised learning algorithm while for CENTIPEDE Model, it serves as the data set to evaluate the accuracy for the prediction.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *