Protected: Understand miRNA
Understand BWA SAM file format — to get the “perfect match” alignment
A good wiki site to start off with
The same question has been asked before. Answer? don’t know
Another post on seqanswers comparing different aligners.
SAM file tag format numeric tag explained.
And, picard provides online sam tag check
SAM format SAM_format helps to extract “perfect match”.
With special tags to BWA alignment
here
I got help from our collaborator, Florian my own note
To get all flag counts:
samtools view ES583_miRbaseMature_bwa.bam | grep -v "@" | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'
flag occurrences
4 2109951
20 1096
0 1190274
16 10244
Now, let’s focus on flag 0
Scenario I
NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN pppppppppppGTGAGGTAGTAGGTTGTATAGTT
With one base off : NM:i:1
With one mismatch in the alignment : XM:i:1
With one ambiguous base in the reference: XN:i:1
Scenario II, same as I, but reported differently by BWA
NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN pppppppppppGTGAGGTAGTAGGTTGTATAGTT (in scenario I) pppppppppppTTGAGGTAGTAGGTTGTATAGTT
With no base off : NM:i:0
With no mismatch in the alignment : XM:i:0
With one ambiguous base in the reference: XN:i:1
Scenario III, same as I, but shifted to the right
NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN pppppppppppGTGAGGTAGTAGGTTGTATAGTT (in scenario I) ppppppppppp TGAGGTAGTAGGTTGTATAGTTT
With one base off : NM:i:1
With one mismatch in the alignment : XM:i:1
With one ambiguous base in the reference: XN:i:1
Scenario IV, a perfect case
NNNNNNNNNNNTGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN pppppppppppTGAGGTAGTAGGTTGTATAGTT
With no base off : NM:i:0
With no mismatch in the alignment : XM:i:0
With no ambiguous base in the reference: XN:i:0
Protected: Yuxia’s project — aligning TQE_14_24 data against miRBase matured miRNA
Tissue specific expression
A request from David Fargo
This is for Jack. They have population (~1,000 samples) 27K methyl chip data and want to relate changes in those data to changes in expression using public data. They are interested in different tissues as well as developmental programs (e.g. stem cells). They’ve classified the methyl data as hypo- and hyper-methylation changes as well as CpG island associated and not associated. This was his note: Here's the gene list and their IDs that are covered on the 27K array. I guess the two things I can think of would be tissue-specific expression levels (perhaps tricky because one would want good comparability between the methods used for the different tissues, and also some developmental program expression data (since some these may only come on at specific times during early development and then only briefly). The later may be the toughest to find. If you can get us that data, we could then look at the different subsets of genes that we're interested in.
David found out the following:
Tissue-specific Gene Expression and Regulation (TiGER) at Wilmer Institue a paper or can we get these a paper or Human Body map might be good?? Perhaps ideally would be the data in gene cards (biogps) for all genes.
My response was posted within a day from the request:
Hi, David & all,
First of all, I have to admit that I have very limited knowledge about the tissue specific expression in general. I just followed the links David provided and here is what I understood so far:
TiGER database was built off two of their previous papers. But, what they called “tissues specific expression” was really based upon the “published EST sequences” deposited in the public domain (i.e. NCBI). And they applied some statistical assessment to determine whether some “ESTs” were significantly “represented” or skewed based on their empirical observation and/or collection. It can be easily skewed toward some favorite tissues that researchers had published thus far. Use an analogy of P53 pathway as an example. Almost all microarray expression results will hit P53 from pathway analysis, just because this pathway has been studied a lot!! So, I would not suggest this database. One good part of the source is their two earlier paper, when they tried to find CRM that associated with the “tissues specific expression”. The framework is quite portable, at least provides a starting point for researching the association between “methylation status” and “expression”.
TiSGeD was also based on a collection from publicly available microarray experiment, but provided very vague definition of “well curated publicly” good data. Plus (as you indicated) it has a dead link anyway.
Shyamasundar, R. et al paper on Genome Biology (2005) was a good pick. They literally did 115 microarray experiment, and data are available on GEO (GSE2193).
The body map project at Ensembl is the most advanced source, as they extended the common microarray platform to RNAseq. The only caveat is their gene model, which was okay according to their pilot study with Zebra fish. Still, it is cross species, who knows whether their model holds well. There are raw sequences (pair end RNAseq) available and a live database. The best part is that the database is “supposed” searchable (although need some learning curve with their API).
I agree with you that “Weizmann Institute of Science” was a great resource, since they had evidence from three platforms concurrently confirming the “tissue specific” expression pattern. However, it creates its own problem, what if they do NOT agree with each other? Which one does one believe? I believe that their database is also searchable, but it definitely needs more work and people may need to acquire academic license or something. One company LifeMap Sciences, Inc must have acquired a license from them.
My take on is:Tier one: Download the microarray GEO (GSE2193), published by Shyamasundar, R. et al and build a Gene x Tissue matrix. Maybe get more data as explained by TiSGeD.
Possibly test out the methodologies explained in Yu, X et al BMC Bioinformatics (2006) and Yu, X et al NAR (2006)
Try to append the quantifiable “methylation status”, “CpG status”, or “other relevant information” to the subset of the above mentioned matrix (by tissue), try to run “association like” analysis to get the “significant information”.Tier two, try the same route with Human Body Map RNAseq “expression” data.
Lastly, try to contact Weizmann Institute of Sciences (with a bag of $ ) to see whether we can get a license of their database and some advice how to search the database. This is really a god resource.
Regarding Jack’s comment on the tissue for experiment acquired at different development stage, I guess that we can only do whatever we have and we don’t have any control over this. Make sense?
This is just my 2 cents, and I really don’t know how complex it can get.
Jianying Li
Learning cutadapt
A good debate over “cutadapt” vs. “fastx-toolkit”
It turns out cutadapt has a good user community. More details can be found here.
Adam’s commands:
cutadapt -m 1 -e 0 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_0mismatch.fastq & cutadapt -m 1 -e 0.056 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_1mismatch.fastq & cutadapt -m 1 -e 0.11 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_2mismatch.fastq & cutadapt -m 1 -e 0.167 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_3mismatch.fastq &
So, based on cutadapt documentation:
-m : minimum length after trimming the adapter, default is 0 -e : number of errors divided by the length of matched base. So, if 31 bases matched with five mismatches in a 36 bp long read, it is 5/31 ~= 0.16
A few years later, cutadapt has much comprehensive user guide
Weird Picard error and solution
While working on the miRNA project, I encountered weird error with Picard, sam_bam convert and even remove duplicate. I thought that it would be caused from jave security problem. In the end, it turned out that it was cause by “duplicate miRBase IDs”
grep hsa-let-7c /ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa >hsa-let-7c MIMAT0000064 Homo sapiens let-7c >hsa-let-7c MI0000064 Homo sapiens let-7c stem-loop
I have to replace all “spaces”.
Now, with WES project, I am using Picard to remove the duplicate. Here are some good postes
1. People are getting high duplication in WES 2. How to determine removed reads for PE sample 3. A thorough Picard documentation 4. Remove or not remove?? 5. Duplication metric in details
Why Picard Markduplicate does not produce histogram
1. First post concludes that it is caused by multiple-groups
, but I do NOT think so.
Protected: miRNA quantification
Protected: My awk tricks
BWA alignment faliure
It occurred when I tried to align miRNAseq data to miRBase. The raw sequence was pre-processed by removing the RNA PCR Primer sequence using fastx-toolkit. The error message is like the following:
[bns_restore_core] fail to open file '/ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa.nt.ann'. Abort!
What does it mean?
I think the error occurred owing to “no” alignment being reported. As the miRBase is too short and no aligner will be able to handle the following case:
The number of 'N' characters to append at the 5'-end and at the 3'-end of each entry. This may be useful for mapping as most mappers can not match reads which start before the reference sequence entry. For example
Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
TTGAGGTAGTAGGTTGTATAGTTA ..read
TGAGGTAGTAGGTTGTATAGTT ..mirbase entry
This sequence can not be mapped to the corresponding mirbase entry with any Mapper. When using GEM instead it is possible to map this read with two mismatches. It is however necessary to append several 'N's at the 5'-end and 3'-end first. For example:
Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
TTGAGGTAGTAGGTTGTATAGTTA ..read
NNTGAGGTAGTAGGTTGTATAGTTNN ..mirbase entry
Now the read could be mapped to the reference sequence (let-7). default=0
Solutions
Step 1 Count the shortest miRbase sequence Step 2 Append Ns to both 5' and 3' ends
I had alignment from BWA, I noticed very strange phenomena. With sam format -flag 4, it still showed “match” in cIGAR string:
64NDWAAXX:5:079:19712:16612 4 hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p 45 25 14M * 0 0 CCATGACCGCTCTG 0A+>:<*A>?>A X0:
i:1 X1:i:0 MD:Z:9G4 XG:i:0 NM:i:1 XM:i:1 XN:i:1 XO:i:0 XT:A:U
Here is what is true:
[li11@ehscmplp09/bioinfo1 perlLibs]$ perl /ddn/gs1/home/li11/perlLibs/seqManip.pl /ddn/gs1/home/li11/refDB/miRBase/hsa_mature_ready_to_map.fa hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p Output: Seen sequence hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p, with seq as NNNNNNNNNNNAGTTTTGCAGGTTTGCATCCAGCNNNNNNNNNNN
And,
[li11@ehscmplp09/bioinfo1 perlLibs]$ perl searchFastqSeq.pl ~/project2013/miRNAseq/processedSeq/TQE_14_24/ES583_trimmed_1mismatch.minQS20.fastq.14_24.fastq @64NDWAAXX:5:079:19712:16612 Output: Seen sequence @64NDWAAXX:5:079:19712:16612, with seq as CCATGACCGCTCTG