Learning cutadapt

Posted on January 31, 2013 by Jin Tong

A good debate over “cutadapt” vs. “fastx-toolkit”
It turns out cutadapt has a good user community. More details can be found here.

Adam’s commands:

cutadapt -m 1 -e 0 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_0mismatch.fastq &
cutadapt -m 1 -e 0.056 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_1mismatch.fastq &
cutadapt -m 1 -e 0.11 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_2mismatch.fastq &
cutadapt -m 1 -e 0.167 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_3mismatch.fastq &

So, based on cutadapt documentation:

-m : minimum length after trimming the adapter, default is 0
-e : number of errors divided by the length of matched base. So, if 31 bases matched with five mismatches in a 36 bp long read, it is 5/31 ~= 0.16

A few years later, cutadapt has much comprehensive user guide

Weird Picard error and solution

Posted on January 23, 2013 by Jin Tong

While working on the miRNA project, I encountered weird error with Picard, sam_bam convert and even remove duplicate. I thought that it would be caused from jave security problem. In the end, it turned out that it was cause by “duplicate miRBase IDs”

grep hsa-let-7c  /ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa
>hsa-let-7c MIMAT0000064 Homo sapiens let-7c
>hsa-let-7c MI0000064 Homo sapiens let-7c stem-loop

I have to replace all “spaces”.

Now, with WES project, I am using Picard to remove the duplicate. Here are some good postes

1. People are getting high duplication in WES
2. How to determine removed reads for PE sample
3. A thorough Picard documentation
4. Remove or not remove??
5. Duplication metric in details

Why Picard Markduplicate does not produce histogram

1. First post concludes that it is caused by multiple-groups

, but I do NOT think so.

Protected: miRNA quantification

Posted on January 23, 2013 by Jin Tong

Protected: My awk tricks

Posted on January 16, 2013 by Jin Tong

BWA alignment faliure

Posted on January 15, 2013 by Jin Tong

It occurred when I tried to align miRNAseq data to miRBase. The raw sequence was pre-processed by removing the RNA PCR Primer sequence using fastx-toolkit. The error message is like the following:

[bns_restore_core] fail to open file '/ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa.nt.ann'. Abort!

What does it mean?

I think the error occurred owing to “no” alignment being reported. As the miRBase is too short and no aligner will be able to handle the following case:



    The number of 'N' characters to append at the 5'-end and at the 3'-end of each entry. This may be useful for mapping as most mappers can not match reads which start before the reference sequence entry. For example

     Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
     TTGAGGTAGTAGGTTGTATAGTTA ..read
      TGAGGTAGTAGGTTGTATAGTT  ..mirbase entry

    This sequence can not be mapped to the corresponding mirbase entry with any Mapper. When using GEM instead it is possible to map this read with two mismatches. It is however necessary to append several 'N's at the 5'-end and 3'-end first. For example:

     Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
      TTGAGGTAGTAGGTTGTATAGTTA   ..read
     NNTGAGGTAGTAGGTTGTATAGTTNN  ..mirbase entry

    Now the read could be mapped to the reference sequence (let-7). default=0

Solutions

Step 1 Count the shortest miRbase sequence
Step 2 Append Ns to both 5' and 3' ends

I had alignment from BWA, I noticed very strange phenomena. With sam format -flag 4, it still showed “match” in cIGAR string:

64NDWAAXX:5:079:19712:16612 4 hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p 45 25 14M * 0 0 CCATGACCGCTCTG 0A+>:<*A>?>A X0:
i:1 X1:i:0 MD:Z:9G4 XG:i:0 NM:i:1 XM:i:1 XN:i:1 XO:i:0 XT:A:U

Here is what is true:

[li11@ehscmplp09/bioinfo1 perlLibs]$ perl /ddn/gs1/home/li11/perlLibs/seqManip.pl /ddn/gs1/home/li11/refDB/miRBase/hsa_mature_ready_to_map.fa  hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p 

Output:
Seen sequence hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p, with seq as NNNNNNNNNNNAGTTTTGCAGGTTTGCATCCAGCNNNNNNNNNNN

And,

[li11@ehscmplp09/bioinfo1 perlLibs]$ perl searchFastqSeq.pl  ~/project2013/miRNAseq/processedSeq/TQE_14_24/ES583_trimmed_1mismatch.minQS20.fastq.14_24.fastq  @64NDWAAXX:5:079:19712:16612

Output:
Seen sequence @64NDWAAXX:5:079:19712:16612, with seq as CCATGACCGCTCTG

Protected: Bioconductor R package admin

Posted on January 14, 2013 by Jin Tong

microRNAseq QC

Posted on January 11, 2013 by Jin Tong

I had found very confused FastQC report for Rick’s miRNAseq experiment. I did not understand and wanted to confirm what I have seen.

I downloaded the data in Kuchen, S et al paper on GenomeResearch, 2010. Using SRR042443.sra as an example to evaluate the data quality. Detail see command_01112013.txt

Using SRR042443.fastq as an example

Step 1: fastq-dump to unzip the data: 
Step 2: fastQC analysis
Step 3: I saw similar results as our in-house, but I did not see the level of "over-represented sequence" matched 100% to RNA PCR primer sequences.
Step 4: Discussed with Adam, David and Pierre and decided to perform "adapter" sequencing trimming.

Strategy:

Step 1. Remove "primer sequence" by JYL (fastx-toolkit) or by AB (cutadapt)
Step 2. Align to miRBase (failed with Primer Sequence removed input file)
Step 3. Test alignment with raw file, while trying to figure out the error.