Learning CPAN

When I tried to install GD module, I had error saying lsdgp 2.0.28 or higher is required.

Frank helped me with a command “yum install GD” 🙂

To get some modules work, sometimes one has to take care of the dependencies. i.e. Font::TTF

Circos: an information aesthetic for comparative genomics

At Steve’s lab meeting, I came across with this paper, published on Genome Research. It seems to be a good piece of software with full length documentation. Circos has been used in many scientific publications. However, to make it work, it is not trivial.

I am trying to document what I have and hopefully, I can make it working in my hand. Fingers crossed 🙂

Since I am the root on SeqBig, I am testing the installation on this machine. Really, the goal would be if I could create a virtualmachine so that I can give it to whoever wants to use it.

Step 1. Get all the CPAN modules installed. Not trivial though..

Honestly, it did not work out as well as I expected. Therefore, ONLY way I made it work is using my virtual machine.

Here is another related page that has more information.

In fact, there are many groups who are consistently searching for tools for draw a circular genome results like circos, i.e. a paper published on Bioinformatics call circleator does similar plots also.

MARCO transcription factor study

Okay, I am trying to help a researcher for her project involving the MARCO genes. She was interested in the transcription binding site upstream of this gene.

I asked my college and got his help with ARE domain analysis. And, I also found this post done by Dave Tang ??

It is very helpful document.

To plot a sequence logo, researchers in the this community tend to use weblogo to plot

Understand the blat and blat output

Here is the deal, I need to parse the “blat” output with two, three, and even four blocks to figure out where the deletion is exactly located:

Sample file format
match    T gap   strand          block   blockSizes      qStarts          tStarts
         bases                   count
298      9962    +       3       69,205,24,      1,70,275,       4132,14162,14368,
299      63      -       3       5,107,188,      0,5,112,        2932,2999,3107,

298      1       +       2       136,163,        1,137,          2970,3107,
213      1       -       2       172,42,         82,257,         137,310,


4-blocks are yet to be implemented in the future

For 2-block positive strand case

298      1       +       2       136,163,        1,137,          2970,3107,
298      1       +       2       147,152,        1,148,          2299,2447,
299      96      +       2       293,6,          1,294,          14350,14739,

Solution is:

3107 - 2970 - 136  = 1   --> [3107,  3107] 

DelStart =  2970 + 136 + 1 [3107]
DelEnd:  =  2970 + 136 + 1 [3107]
# Match stops at 3106 (2970 + 136), with 3107 missing, another match starts at 3108


2447 - 2299 - 147  = 1   --> [2447,  2447] 
DelStart =  2299 + 147 + 1 [2447]
DelEnd:  =  2299 + 147 + 1 [2447]

# Match stops at 2446 (2299 + 147), with 2447 missing, another match starts at 2448

14739 - 14350 - 293  = 96   --> [14644,  14739] 
DelStart =  14350 + 293 + 1   [14644]
DelEnd:  =  14350 + 293 + 96  [14739]

# Match stops at 14643 (14350 + 293 ), with [14644-14739] missing, another match starts at 14740

For 2-block negative strand case, same as POSITIVE strand !!!

298      402     -      2       7,293,          0,7,    263,672,
#Query:@HWI-M01825:53:000000000-A8MJM:1:1101:16256:1461      

672 - 7 - 263 = 402  --> [264, 270]  [271, 672] 

# Match starts at 264 and stops at 270 with 7 bps aligned, breaks [271, 672] , 
match start at 673
# As a result ==> break starts at 271 (= 263 + 7 + 1), with 402 long deletion
Therefore:
DelStart: 263 + 7 + 1 [271]
DelEnd:   263 + 7 +  402  [672] 

BLAT results details as following:
----------------------------------------------------------------

Side by Side Alignment


0001 ggggagggggtgatctaaaacactctttacgccggcttctattgacttgg 0050
<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<
0965 ggggagggggtgatctaaaacactctttacgccggcttctattgacttgg 0916

0051 gttaatcgtgtgaccgcggtggctggcacgaaattgaccaaccctggggt 0100
<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<
0915 gttaatcgtgtgaccgcggtggctggcacgaaattgaccaaccctggggt 0866

0101 tagtatagcttagttaaactttcgtttattgctaaaggttaatcactgct 0150
<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<
0865 tagtatagcttagttaaactttcgtttattgctaaaggttaatcactgct 0816

0151 gtttcccgtgggggtgtggctaggctaagcgttttgagctgcattgctgc 0200
<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<
0815 gtttcccgtgggggtgtggctaggctaagcgttttgagctgcattgctgc 0766

0201 gtgcttgatgcttgtcccttttgatcgtggtgatttagagggtgaactca 0250
<<<< ||||||||||||||| |||||||||||||||||||||||||||||||||| <<<<
0765 gtgcttgatgcttgttccttttgatcgtggtgatttagagggtgaactca 0716

0251 ctggaacgggggtgcttgcatgtgtaatcttactaagagctaa 0293
<<<< ||||||||||| ||||||||||||||||||||||||||||||| <<<<
0715 ctggaacggggatgcttgcatgtgtaatcttactaagagctaa 0673

-------------------------------------------------------------------
0294 tggaaag 0300
<<<< ||||||| <<<<
0270 tggaaag 0264

-------------------------------------------------------------------

More on negative cases:

213      1       -       2       172,42,         82,257,         137,310,
298      1       -       2       37,262,         0,37,   3069,3107,
298      222     -       2       293,6,          1,294,          11425,11940,
298      4939    -       2       239,60,         1,240,          717,5895,


310   -  172 -137     =  1   -->  [138, 309]     
==> break starts at 310   (137   + 172 + 1 ) with 1 bp long deletion
DelStart: 137   + 172 + 1 [310]
DelEnd:   137   + 172 + 1 [310]

3107  - 37  -3069     =  1   -->  [3070, 3106]   
==> break starts at 3107  (3069  + 37  + 1) with 1 bp long deletion
DelStart: 3069 + 37  + 1 [3107]
DelEnd:   3069 + 37  + 1 [3107]

11940 - 293 - 11425   = 222  -->  [11426, 11939] 
==> braak starts at 11719 (11425 + 293 + 1) with 222 bps long deletion
DelStart: 11425 + 293 + 1    [11719]
DelEnd:   11425 + 293  + 222 [11940]

5895  - 239 - 717     = 4939 -->  [718,   5894 ] 
==> break starts at 957   (717   + 239 + 1) with 4939 bps long deletion
DelStart: 717 + 239 + 1    [957]
DelEnd:   717 + 239  + 4939 [5895]


Solution is: there was one-base off !!!



Solution is:

Sample read:

Hs_Mito_ref:
Case: 3-blocks positive strand

298      9962    +       3       69,205,24,      1,70,275,       4132,14162,14368,

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|251831106|ref|NC_012920.1|_ChrM                                    437   e-122
gi|251831106|ref|NC_012920.1|_ChrM                                    135   7e-32



>gi|251831106|ref|NC_012920.1|_ChrM 
          Length = 16569

 Score = 437 bits (1127), Expect = e-122
 Identities = 229/230 (100%)
 Strand = Plus / Plus

Query: 71    caatctcaattacaatatatacaccaacaaacaatgttcaaccagtaactactactaatc 130
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 14163 caatctcaattacaatatatacaccaacaaacaatgttcaaccagtaactactactaatc 14222

Query: 131   aacgcccataatcatacaaagcccccgcaccaataggatcctcccgaatcaaccctgacc 190
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 14223 aacgcccataatcatacaaagcccccgcaccaataggatcctcccgaatcaaccctgacc 14282

Query: 191   cctctccttcataaattattcagcttcctacactattaaagtttaccacaaccaccaccc 250
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 14283 cctctccttcataaattattcagcttcctacactattaaagtttaccacaaccaccaccc 14342

Query: 251   catcatactctttcacccacagcac-aatcctacctccatcgctaacccc 299
             ||||||||||||||||||||||||| ||||||||||||||||||||||||
Sbjct: 14343 catcatactctttcacccacagcaccaatcctacctccatcgctaacccc 14392


 Score = 135 bits (348), Expect = 7e-32
 Identities = 69/69 (100%)
 Strand = Plus / Plus

Query: 2    catacccccgattccgctacgaccaactcatacacctcctatgaaaaaacttcctaccac 61
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 4133 catacccccgattccgctacgaccaactcatacacctcctatgaaaaaacttcctaccac 4192

Query: 62   tcaccctag 70
            |||||||||
Sbjct: 4193 tcaccctag 4201

Another 3 block examples with both breaks greater than 15

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|251831106|ref|NC_012920.1|_ChrM                                    472   e-133
gi|251831106|ref|NC_012920.1|_ChrM                                     92   7e-19
gi|251831106|ref|NC_012920.1|_ChrM                                     14   2e+05



>gi|251831106|ref|NC_012920.1|_ChrM 
          Length = 16569

 Score = 472 bits (1219), Expect = e-133
 Identities = 244/246 (99%)
 Strand = Plus / Plus

Query: 48    caattctccgatccgtccctaacaaactaggaggcgtccttgccctattactatccatcc 107
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 15582 caattctccgatccgtccctaacaaactaggaggcgtccttgccctattactatccatcc 15641

Query: 108   tcatcctagcaataatccccatcctccatatatccaaacaacaaagcataatacttcgcc 167
             ||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||
Sbjct: 15642 tcatcctagcaataatccccatcctccatatatccaaacaacaaagcataatatttcgcc 15701

Query: 168   cactaagccaatcactttattgactcctagccgcagacctcctcattctaacctgaatcg 227
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 15702 cactaagccaatcactttattgactcctagccgcagacctcctcattctaacctgaatcg 15761

Query: 228   gaggacaaccagtaagctacccttttaccatcattggaccagtagcatccgtactatact 287
             ||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||
Sbjct: 15762 gaggacaaccagtaagctacccttttaccatcattggacaagtagcatccgtactatact 15821

Query: 288   tcacaa 293
             ||||||
Sbjct: 15822 tcacaa 15827


 Score = 92 bits (237), Expect = 7e-19
 Identities = 47/47 (100%)
 Strand = Plus / Plus

Query: 1    catcccgatggtgcagccgctattaaaggttcgtttgttcaacgatt 47
            |||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3004 catcccgatggtgcagccgctattaaaggttcgtttgttcaacgatt 3050


 Score = 14 bits (36), Expect = 2e+05
 Identities = 7/7 (100%)
 Strand = Plus / Plus

Query: 294   cctgtcc 300
             |||||||
Sbjct: 15885 cctgtcc 15891

Case: 3-blocks negative strand


296      10788   -       3       113,182,5,      0,113,295, 5320,16167,16403,


Sequences producing significant alignments:                      (bits) Value

gi|251831106|ref|NC_012920.1|_ChrM                                    347   1e-95
gi|251831106|ref|NC_012920.1|_ChrM                                    216   2e-56
gi|251831106|ref|NC_012920.1|_ChrM                                     10   3e+06



>gi|251831106|ref|NC_012920.1|_ChrM 
          Length = 16569

 Score = 347 bits (895), Expect = 1e-95
 Identities = 179/182 (98%)
 Strand = Minus / Plus

Query: 187   ccaatccacatcaaaaccccctcctcatgcttacaagcaagtacagcaatcaaccctcaa 128
             |||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||
Sbjct: 16168 ccaatccacatcaaaaccccctccccatgcttacaagcaagtacagcaatcaaccctcaa 16227

Query: 127   ctatcacacatcaactgcaactccaaagtcacccctcacccattaggataccaacaaacc 68
             |||||||||||||||||||||||||||| ||||||||||||| |||||||||||||||||
Sbjct: 16228 ctatcacacatcaactgcaactccaaagccacccctcacccactaggataccaacaaacc 16287

Query: 67    tacccacccttaacagtacatagtacataaagccatttaccgtacatagcacattacagt 8
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 16288 tacccacccttaacagtacatagtacataaagccatttaccgtacatagcacattacagt 16347

Query: 7     ca 6
             ||
Sbjct: 16348 ca 16349


 Score = 216 bits (558), Expect = 2e-56
 Identities = 112/113 (99%)
 Strand = Minus / Plus

Query: 300  catcaccctccttaacctttacttctacctacgcctaatctactccacctcaatcacact 241
            |||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
Sbjct: 5321 catcaccctccttaacctctacttctacctacgcctaatctactccacctcaatcacact 5380

Query: 240  actccccatatctaacaacgtaaaaataaaatgacagtttgaacatacaaaac 188
            |||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 5381 actccccatatctaacaacgtaaaaataaaatgacagtttgaacatacaaaac 5433


 Score = 10 bits (25), Expect = 3e+06
 Identities = 5/5 (100%)
 Strand = Minus / Plus

Query: 5     catcc 1
             |||||
Sbjct: 16404 catcc 16408

Case: 3-blocks negative strand — bad example

275      13167   -   3       82,132,62,      24,106,238,     3024,3107,16405,

Output:

BLASTN 2.2.11 [blat]

Reference:  Kent, WJ. (2002) BLAT - The BLAST-like alignment tool

Query= @HWI-M01825:53:000000000-A8MJM:1:1101:23159:10038
         (300 letters)

Database: ../../../reference/hs_mit_seq.fasta 
           1 sequences; 16,569 total letters

Searching.done
                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|251831106|ref|NC_012920.1|_ChrM                                    405   e-113
gi|251831106|ref|NC_012920.1|_ChrM                                    122   6e-28



>gi|251831106|ref|NC_012920.1|_ChrM 
          Length = 16569

 Score = 405 bits (1044), Expect = e-113
 Identities = 213/215 (99%)
 Strand = Minus / Plus

Query: 276  attaaaggttcgtttgttcaacgattaaagtcctacgtgatctgagttcagaccggagta 217
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3025 attaaaggttcgtttgttcaacgattaaagtcctacgtgatctgagttcagaccggagta 3084

Query: 216  atccaggtcggtttctatctac-ttcaaattcctccctgtacgaaaggacaagagaaata 158
            |||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||
Sbjct: 3085 atccaggtcggtttctatctacnttcaaattcctccctgtacgaaaggacaagagaaata 3144

Query: 157  aggcctacttcacaaagcgccttcccccgtaaatgatatcatctcaacttagcattatac 98
            |||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||
Sbjct: 3145 aggcctacttcacaaagcgccttcccccgtaaatgatatcatctcaacttagtattatac 3204

Query: 97   ccacacccacccaagaacagggtttgttaagatgg 63
            |||||||||||||||||||||||||||||||||||
Sbjct: 3205 ccacacccacccaagaacagggtttgttaagatgg 3239


 Score = 122 bits (315), Expect = 6e-28
 Identities = 62/62 (100%)
 Strand = Minus / Plus

Query: 62    tcctccgtgaaatcaatatcccgcacaagagtgctactctcctcgctccgggcccataac 3
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 16406 tcctccgtgaaatcaatatcccgcacaagagtgctactctcctcgctccgggcccataac 16465

Query: 2     ac 1
             ||
Sbjct: 16466 ac 16467

Learning and using mitomap

Today, I registered with mitomap and started to learn how to use it.

It is quite interesting that deletion often occurs at the direct repeat, i.e.

DelJunction    DelSize  RepeatLocation          RepeatType 
10058:14593	-4534	10059-10063/14593-14597	D, 5/5
10169:14435	-4265	10161-10165/14424-14428	D, 5/5

In the first case, the deletion starts at 10058, right before the first repeat (10059-10063); and ends at 14593, right before the second “direct repeat” starts. As a result, one repeat (10059-10063) disappears.

In the second case, the deletion starts at 10169, after the first repeat (10161-10165); and ends at 14435, which contains that the second “direct repeat”. As a result, one repeat (14424-14428) disappears.

So, to better understand the scenario helps to implement the analytical processes.

There is a Perl module developed by the team called MitoMaster. It turns out to be a good tool

Now, with our mitochondrial project, we can search for possible “repeats” that flank a deletion.

For a given deletion junction: 10058:14593

  • Case I: search up to 15 bp after break starts (10059 - 10074), against up to 15 bps right after the break ends: 14593
  • Case II: search up to 15 bp before break starts (10043- 10058), against up to 15 bps before after the break ends: 14578 - 14593
  • For each case, "number of repeat base" = min (15 bp, lengthOfDel)

I got help from tcoffee for local alignment Perl scripts.

While working on this project, I had experienced MUCH hassle with blat and blat output, especially to determine the “deletion” start and stop. It is also has something to do with the “zero” base coordinates or “one” base coordinates across different systems. Therefore, I created a separate post, hopefully could help me to get the bottom of this issues.

I found a very interesting alignment with G451E

In this directory: /ddn/gs1/home/li11/project2014/Copeland/IlluminaData/blatApp/gapHeatMap/
Issues this command: awk -F”\t” ‘{ if ($18 ==3 && $8 == 16054 ) print $1, “\t”, $8 ,”\t”, $9, “\t”, $10, “\t”, $18 , “\t”, $19, “\t”,$20 , “\t”, $21 }’ G451E_R1.psl

I found:

108 16054 – @HWI-M01825:53:000000000-A8MJM:1:1101:17663:13886 3 50,4,55, 191,241,245, 148,200,16256,
295 16054 – @HWI-M01825:53:000000000-A8MJM:1:1117:6130:11630 3 69,4,227, 0,69,73, 129,200,16256,
295 16054 + @HWI-M01825:53:000000000-A8MJM:1:2117:21755:9298 3 178,4,118, 0,178,182, 20,200,16256,

Based on my deletion detection rule:

diff1 = 200 - (148 + 4)  = 48
diff1 = 200 - (129 + 4)  = 67
diff1 = 200 - (20 + 4)   = 176

diff2 = 16256 - (200 + 50)   = 16006
diff2 = 16256 - (200 + 69)   = 15987 
diff2 = 16256 - (200 + 178)  = 15878 

 
totalDelLength = 16006 + 48   = 16054
totalDelLength = 15987 + 67   = 16054
totalDelLength = 15878 + 176  = 16054

#First one:

if (diff1 > 15)  ==> deletion is [153, 200]
DelStart = (148 + 4)  + 1 
DelEnd   = 200

if (diff2 > 15)  ==> deletion is [16055, 16256]
DelStart = 16006 + 48  + 1
DelEnd   = 16256

#Second one: 
if (diff1 > 15)  ==> deletion is [134, 200]
DelStart = (129 + 4)  + 1 
DelEnd   = 200

if (diff2 > 15)  ==> deletion is [16055, 16256]
DelStart = 15987 + 67  + 1
DelEnd   = 16256

#Third case:
if (diff1 > 15)  ==> deletion is [25, 200]
DelStart = (20 + 4)  + 1 
DelEnd   = 200

if (diff2 > 15)  ==> deletion is [16055, 16256]
DelStart = 15878 + 176  + 1
DelEnd   = 16256

Working with microRNA affy array

It is kind of shame as I did not save the “dat” object when I did the micorRNA array data analysis for Mahmet. Mahmet recently requested for the “expression” data for the microRNA affyarray data. I realized that my R code is NOT working anymore. Here are the error messages

dat <- read.celfiles (filenames=paste(celpath,pheno$ChipBarcode,".CEL",sep=""), sampleNames=rownames(pheno), phenoData=pheno)

##===================================================================================
#Error in checkSlotAssignment(object, name, value) :
# assignment of an object of class “data.frame” is not valid for slot ‘phenoData’ in an object of class “ExpressionFeatureSet”; is(value, “AnnotatedDataFrame”) is not TRUE
#Error during wrapup: cannot open the connection
##====================================================================================

With the following command:

# The following command works, but RMA will fail (also pvac) because this it
# NOT an AffyBatch
dat <- read.celfiles (filenames=paste(celpath,pheno$ChipBarcode,".CEL",sep=""))
pData(dat) <- pheno
dt.rma = rma(dat) # run RMA algorithm

#Error in (function (classes, fdef, mtable) :
# unable to find an inherited method for function ‘probeNames’ for signature ‘”ExpressionFeatureSet”’
#Error during wrapup: cannot open the connection

It definitely worked in the past. However, the consequences are that I have to redo it!

Let’s look at the cdf for microRNA array

Some debate on AffyBatch vs oligo pacakges