Bimodal distribution

In the proliferated cell ploidy testing, I encountered a bimodality phenomino

Density-plots

There are a number of existing tests for the modality of a density underlying an observed distribution, including Silverman’s test, the Dip test, the Excess Mass test, and the MAP and RUNT tests.

As a simple illustration, consider a system described by a random variable X, which switches between two well defined states, 1 and 2 with probabilities p and 1-p. Assume that the conditional density of X given the state is normal in each of states 1 and 2 and denote it f1(x) and f2(x), respectively. Then the unconditional density will be p f1(x) + (1-p) f2(x). It can be easily observed that if the means of the two densities are different, then certain combinations of the standard deviations and the probability p result in a bimodal unconditional density.

A Bayesian’s approach
 

Tackle the problem with methylKit

It seems like a promising R package on this.

Installation needs a little more attention. Here are steps in details:


#Dependencies
install.packages("data.table")
source("http://www.bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

#Installation
download.file("http://methylkit.googlecode.com/files/methylKit_0.5.7.tar.gz",destfile="methylKit_0.5.7.tar.gz")
install.packages("methylKit_0.5.7.tar.gz",repos=NULL,type="source")
unlink("methylKit_0.5.7.tar.gz")

Well, two hurdles were blocking me

1. I can't get the CpG island annotation for mouse from UCSC, when I followed the instruction :

For CpG island annotation, select "Expression and Regulation" from the "group" drop-down menu. Following that, select "CpG islands" from the "track" drop-down menu. Select "BED- browser extensible data" for the "output format". Click "get output" and on the following page click "get BED" without changing any options. save the output as a text file.
2. Our .sam file was not properly sorted. One extra step is need which is grep -v '^[[:space:]]*@' raw.sam | sort -k3,3 -k4,4n > sorted.sam 3. It turns out that this "sorting" is taking too much of the /tmp/ space to the extend that no server is designed to handle this problem, except wine. Thanks go to Frank who allows me to access wine2 temporarily. Both servers work with this just fine.

It turns out that we have three separate libraries and we have both raw and de-duplicated .sam files. So, firstly I need to combine those three files before anything can be done further. Solutions though

Basically, I need to

Generate deduplicated, merged-library, Picard-sorted & reordered SAM files for each animal

Let’s try samtools

  • Convert .sam to .bam: samtools view -S in.sam -bo out.bam
  • Sort .bam file: samtools sort out.bam out.bam.sorted
  • Merge multiple .bam files: samtools merge merged.bam 1.out.bam.sorted 2.out.bam.sorted 3.out.bam.sorted
  • Sort merged .bam file: samtools sort merged.bam merged.sorted.bam

Now, let’s take a look at picard tools

  • Convert .sam to .bam:
  • Merge multiple .bam files
  • sort them

What about unix command, a quite simple unix solution

  • Remove headers
  • tail -n +42 B6_M_1.L2x4.mm9.raw_bismark_deduplicated.sam temp_B6_M_1.L2x4.mm9.raw_bismark_deduplicated.sam

    tail -n +51 B6_M_1.L3x13.mm9.raw_bismark_deduplicated.sam temp_B6_M_1.L3x13.mm9.raw_bismark_deduplicated.sam

  • Concatenate them
    cat B6_M_1.L1x2.mm9.raw_bismark_deduplicated.sam temp_B6_M_1.L2x4.mm9.raw_bismark_deduplicated.sam temp_B6_M_1.L3x13.mm9.raw_bismark_deduplicated.sam > B6_M_1.merged.sam
  • Convert and sort them with picard tools
    picard-tools-1.42/SortSam.jar INPUT=hello.sam OUTPUT=hello.sorted.sam CREATE_INDEX=false SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
    picard-tools-1.42/SortSam.jar INPUT=hello.sorted.sam OUTPUT=hello.sorted.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.

Re-do DNAseq SNP calling

I found out that there were more data not used by Sara in her alignment:

Hi, Xiaojiang & Adam,

Per Xiaojiang’s presentation, it seems that we had three lanes of DNAseq. I used alignment Sara provided when I did ASE analysis. I would give it a try at least on the parental strains to see any changes with the SNV calls.  Your any attention will be appreciated.

FYI, the following were the files I used:

##===================================================================================
#ID     NISC_raw_BAM    NTP_ID  genotype
sample1 111025_MORPHEUS_D09D9ACXX.4.bam NTP_02_DNA      BL6 male
sample2 110922_MORPHEUS_C052AACXX.1.bam NTP_15_DNA      BL6 female
sample3 110922_MORPHEUS_C052AACXX.2.bam NTP_25_DNA      C3H male
sample4 110922_MORPHEUS_C052AACXX.4.bam NTP_32_DNA      C3H female
sample5 110922_MORPHEUS_C052AACXX.8.bam NTP_213_DNA     F1 male from NTP-02/32 [BL6m x C3Hf]
sample6 110922_MORPHEUS_C052AACXX.7.bam NTP_221_DNA     F1 female from NTP-02/32 [BL6m x C3Hf]
sample7 111014_NIOBE_C0522ACXX.8.bam    NTP-121_DNA     F1 male from NTP-15/25 [BL6f x C3Hm]
sample8 110922_MORPHEUS_C052AACXX.5.bam NTP_123_DNA     F1 female from NTP-15/25 [BL6f x C3Hm]
##===================================================================================

Thanks.

Jianying Li

Adam provided information on all the available data

Hi Jianying,

The DNA-seq raw data can be found in /ddn/gs1/project/mousemeth/DNAseq/raw_reads/NISC_111018.  There are 3 lanes each for NTP_15_DNA (BL6 female) and NTP_32_DNA (C3H female), and 1 each for the rest.  The BAM files contained in this directory were aligned with ELAND by the NISC pipeline, so you will probably want to start with the FASTQs.

Thanks,

Adam

So, I have to re-align all the data and call SNP myself.

Step 1. Create a new folder: SNPcalling under ~/project2013/
Step 2. Align raw files into mm9
[li11@ehscmplp14/bioinfo2 NISC_111018]$ ls -l NTP_15_*
-r--r--r-- 1 burkholderab ngsuser 56026816392 Oct 18  2011 NTP_15_DNA_110922_MORPHEUS_C052AACXX.1.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 56026816392 Oct 18  2011 NTP_15_DNA_110922_MORPHEUS_C052AACXX.1.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 41436725311 Oct 18  2011 NTP_15_DNA_110922_MORPHEUS_C052AACXX.1.bam
-r--r--r-- 1 burkholderab ngsuser 52997281364 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.5.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 52997281364 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.5.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 38105590148 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.5.bam
-r--r--r-- 1 burkholderab ngsuser 52256364140 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.6.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 52256364140 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.6.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 37258274997 Dec  5  2011 NTP_15_DNA_111115_NIOBE_D0B7PACXX.6.bam
[li11@ehscmplp14/bioinfo2 NISC_111018]$ ls -l NTP_32_*
-r--r--r-- 1 burkholderab ngsuser 42257747650 Oct 18  2011 NTP_32_DNA_110922_MORPHEUS_C052AACXX.4.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 42257747650 Oct 18  2011 NTP_32_DNA_110922_MORPHEUS_C052AACXX.4.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 28324903863 Oct 18  2011 NTP_32_DNA_110922_MORPHEUS_C052AACXX.4.bam
-r--r--r-- 1 burkholderab ngsuser 42775914395 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.7.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 42775914395 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.7.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 29834785339 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.7.bam
-r--r--r-- 1 burkholderab ngsuser 42938647452 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.8.1.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 42938647452 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.8.2.sanger.fastq
-r--r--r-- 1 burkholderab ngsuser 30675427143 Dec  5  2011 NTP_32_DNA_111115_NIOBE_D0B7PACXX.8.bam