Understand BWA SAM file format — to get the “perfect match” alignment

A good wiki site to start off with

The same question has been asked before. Answer? don’t know

Another post on seqanswers comparing different aligners.

SAM file tag format numeric tag explained.

And, picard provides online sam tag check

SAM format SAM_format helps to extract “perfect match”.

With special tags to BWA alignment here

I got help from our collaborator, Florian my own note

To get all flag counts:

samtools view  ES583_miRbaseMature_bwa.bam | grep -v "@" | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'
flag    occurrences
4       2109951
20      1096
0       1190274
16      10244

Now, let’s focus on flag 0

Scenario I

NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN
pppppppppppGTGAGGTAGTAGGTTGTATAGTT

With one base off : NM:i:1
With one mismatch in the alignment : XM:i:1
With one ambiguous base in the reference: XN:i:1

Scenario II, same as I, but reported differently by BWA

NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN
pppppppppppGTGAGGTAGTAGGTTGTATAGTT (in scenario I)
pppppppppppTTGAGGTAGTAGGTTGTATAGTT

With no base off : NM:i:0
With no mismatch in the alignment : XM:i:0
With one ambiguous base in the reference: XN:i:1

Scenario III, same as I, but shifted to the right

NNNNNNNNNNN TGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN
pppppppppppGTGAGGTAGTAGGTTGTATAGTT (in scenario I)
ppppppppppp TGAGGTAGTAGGTTGTATAGTTT

With one base off : NM:i:1
With one mismatch in the alignment : XM:i:1
With one ambiguous base in the reference: XN:i:1

Scenario IV, a perfect case

NNNNNNNNNNNTGAGGTAGTAGGTTGTATAGTTNNNNNNNNNNN
pppppppppppTGAGGTAGTAGGTTGTATAGTT

With no base off : NM:i:0
With no mismatch in the alignment : XM:i:0
With no ambiguous base in the reference: XN:i:0

Tissue specific expression

A request from David Fargo

This is for Jack.  They have population (~1,000 samples) 27K methyl chip data and want to relate changes in those data to changes in expression using public data. They are interested in different tissues as well as developmental programs (e.g. stem cells).  They’ve classified the methyl data as hypo- and hyper-methylation changes as well as CpG island associated and not associated.

This was his note:

Here's the gene list and their IDs that are covered on the 27K array.  I guess the two things I can think of would be tissue-specific expression levels (perhaps tricky because one would want good comparability between the methods used for the different tissues, and also some developmental program expression data (since some these may only come on at specific times during early development and then only briefly).  The later may be the toughest to find.  If you can get us that data, we could then look at the different subsets of genes that we're interested in.

David found out the following:

Tissue-specific Gene Expression and Regulation (TiGER)  at Wilmer Institue 

 a paper 


or can we get these

 a paper 

or

 Human Body map might be good?? 





Perhaps ideally would be the data in  gene cards (biogps) for all genes.


My response was posted within a day from the request:

Hi, David & all,

First of all, I have to admit that I have very limited knowledge about the tissue specific expression in general. I just followed the links David provided and here is what I understood so far:

TiGER database was built off two of their previous papers. But, what they called “tissues specific expression” was really based upon the “published EST sequences” deposited in the public domain (i.e. NCBI). And they applied some statistical assessment to determine whether some “ESTs” were significantly “represented” or skewed based on their empirical observation and/or collection. It can be easily skewed toward some favorite tissues that researchers had published thus far. Use an analogy of P53 pathway as an example. Almost all microarray expression results will hit P53 from pathway analysis, just because this pathway has been studied a lot!! So, I would not suggest this database. One good part of the source is their two earlier paper, when they tried to find CRM that associated with the “tissues specific expression”. The framework is quite portable, at least provides a starting point for researching the association between “methylation status” and “expression”.

TiSGeD was also based on a collection from publicly available microarray experiment, but provided very vague definition of “well curated publicly” good data. Plus (as you indicated) it has a dead link anyway.

Shyamasundar, R. et al paper on Genome Biology (2005) was a good pick. They literally did 115 microarray experiment, and data are available on GEO (GSE2193).

The body map project at Ensembl is the most advanced source, as they extended the common microarray platform to RNAseq. The only caveat is their gene model, which was okay according to their pilot study with Zebra fish. Still, it is cross species, who knows whether their model holds well. There are raw sequences (pair end RNAseq) available and a live database. The best part is that the database is “supposed” searchable (although need some learning curve with their API).
I agree with you that “Weizmann Institute of Science” was a great resource, since they had evidence from three platforms concurrently confirming the “tissue specific” expression pattern. However, it creates its own problem, what if they do NOT agree with each other?  Which one does one believe? I believe that their database is also searchable, but it definitely needs more work and people may need to acquire academic license or something. One company LifeMap Sciences, Inc must have acquired a license from them.
My take on is:

Tier one: Download the microarray GEO (GSE2193), published by Shyamasundar, R. et al and build a Gene x Tissue matrix. Maybe get more data as explained by TiSGeD.
Possibly test out the methodologies explained in Yu, X et al BMC Bioinformatics (2006) and Yu, X et al NAR (2006)
Try to append the quantifiable “methylation status”, “CpG status”, or “other relevant information” to the subset of the above mentioned matrix (by tissue), try to run “association like” analysis to get the “significant information”.

Tier two, try the same route with Human Body Map RNAseq “expression” data.

Lastly, try to contact Weizmann Institute of Sciences (with a bag of $ ) to see whether we can get a license of their database and some advice how to search the database. This is really a god resource.

Regarding Jack’s comment on the tissue for experiment acquired at different development stage, I guess that we can only do whatever we have and we don’t have any control over this. Make sense?

This is just my 2 cents, and I really don’t know how complex it can get.

Jianying Li

Learning cutadapt

A good debate over “cutadapt” vs. “fastx-toolkit”
It turns out cutadapt has a good user community. More details can be found here.

Adam’s commands:

cutadapt -m 1 -e 0 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_0mismatch.fastq &
cutadapt -m 1 -e 0.056 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_1mismatch.fastq &
cutadapt -m 1 -e 0.11 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_2mismatch.fastq &
cutadapt -m 1 -e 0.167 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG -O 5 /ddn/gs1/project/nextgen/post/paules/NG-104-Paules-Cui/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq > ES565_trimmed_3mismatch.fastq &

So, based on cutadapt documentation:

-m : minimum length after trimming the adapter, default is 0
-e : number of errors divided by the length of matched base. So, if 31 bases matched with five mismatches in a 36 bp long read, it is 5/31 ~= 0.16

A few years later, cutadapt has much comprehensive user guide

Weird Picard error and solution

While working on the miRNA project, I encountered weird error with Picard, sam_bam convert and even remove duplicate. I thought that it would be caused from jave security problem. In the end, it turned out that it was cause by “duplicate miRBase IDs”

grep hsa-let-7c  /ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa
>hsa-let-7c MIMAT0000064 Homo sapiens let-7c
>hsa-let-7c MI0000064 Homo sapiens let-7c stem-loop

I have to replace all “spaces”.

Now, with WES project, I am using Picard to remove the duplicate. Here are some good postes

1. People are getting high duplication in WES
2. How to determine removed reads for PE sample
3. A thorough Picard documentation
4. Remove or not remove??
5. Duplication metric in details

Why Picard Markduplicate does not produce histogram

1. First post concludes that it is caused by multiple-groups

, but I do NOT think so.

BWA alignment faliure

It occurred when I tried to align miRNAseq data to miRBase. The raw sequence was pre-processed by removing the RNA PCR Primer sequence using fastx-toolkit. The error message is like the following:

[bns_restore_core] fail to open file '/ddn/gs1/home/li11/refDB/miRBase/hsa_combined.fa.nt.ann'. Abort!

What does it mean?

I think the error occurred owing to “no” alignment being reported. As the miRBase is too short and no aligner will be able to handle the following case:



    The number of 'N' characters to append at the 5'-end and at the 3'-end of each entry. This may be useful for mapping as most mappers can not match reads which start before the reference sequence entry. For example

     Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
     TTGAGGTAGTAGGTTGTATAGTTA ..read
      TGAGGTAGTAGGTTGTATAGTT  ..mirbase entry

    This sequence can not be mapped to the corresponding mirbase entry with any Mapper. When using GEM instead it is possible to map this read with two mismatches. It is however necessary to append several 'N's at the 5'-end and 3'-end first. For example:

     Mapping to cel-let-7 MIMAT0000001 Caenorhabditis elegans let-7
      TTGAGGTAGTAGGTTGTATAGTTA   ..read
     NNTGAGGTAGTAGGTTGTATAGTTNN  ..mirbase entry

    Now the read could be mapped to the reference sequence (let-7). default=0

Solutions

Step 1 Count the shortest miRbase sequence
Step 2 Append Ns to both 5' and 3' ends

I had alignment from BWA, I noticed very strange phenomena. With sam format -flag 4, it still showed “match” in cIGAR string:

64NDWAAXX:5:079:19712:16612 4 hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p 45 25 14M * 0 0 CCATGACCGCTCTG 0A+>:<*A>?>A X0:
i:1 X1:i:0 MD:Z:9G4 XG:i:0 NM:i:1 XM:i:1 XN:i:1 XO:i:0 XT:A:U

Here is what is true:

[li11@ehscmplp09/bioinfo1 perlLibs]$ perl /ddn/gs1/home/li11/perlLibs/seqManip.pl /ddn/gs1/home/li11/refDB/miRBase/hsa_mature_ready_to_map.fa  hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p 

Output:
Seen sequence hsa-miR-19b-1-5p_MIMAT0004491_Homo_sapiens_miR-19b-1-5p, with seq as NNNNNNNNNNNAGTTTTGCAGGTTTGCATCCAGCNNNNNNNNNNN

And,

[li11@ehscmplp09/bioinfo1 perlLibs]$ perl searchFastqSeq.pl  ~/project2013/miRNAseq/processedSeq/TQE_14_24/ES583_trimmed_1mismatch.minQS20.fastq.14_24.fastq  @64NDWAAXX:5:079:19712:16612

Output:
Seen sequence @64NDWAAXX:5:079:19712:16612, with seq as CCATGACCGCTCTG