Before staring the re-alignment effort full force, I would let David and SVA users know that besides a new reference sequence, there will be a Samtools upgrade.
Here are some easy checks I would recommend.
(A) Comparison to SVA routines
-”pipeline concordance” with concordance computed with SVA
- average coverage in the pipeline with coverage computed with SVA.
We’ve done this for the genome but not exome samples yet.
(B) Consider emailing SVA users and asking them about some basic checks we can do or ask them to do some checks on a few exome samples aligned with the new build.
(D) Consider asking Mingfu if his Erds results (genome) look reasonable. Erds uses SNVs generated in pipeline
(E) You might also consider moving up to a newer version of Samtools.
The most recent is 1.17.
From what I’ve read, I would at least go up to version 1.13.
Listed below are some of the notable changes I’ve extracted from this file (http://sourceforge.net/projects/samtools/files/samtools/0.1.17/).
* bug fixes
* pileup command dropped
* Bugfix: some reads without coordinates but given on the reverse strand are
lost in merging.
* Added the `depth’ command to samtools to compute the per-base depth with a
simpler interface. File `bam2depth.c’, which implements this command, is the
recommended example on how to use the mpileup APIs.
* Added `samtools mpileup -L’ to skip INDEL calling in regions with
excessively high coverage. Such regions dramatically slow down mpileup.
* The most important though largely invisible modification is the change of the order of genotypes in the PL VCF/BCF tag. This is to conform the upcoming VCF spec v4.1. The change means that 0.1.13 is not backward compatible with VCF/BCF generated by samtools older than r921 inclusive. VCF/BCF generated by the new samtools will contain a line `##fileformat=VCFv4.1′ as well as the samtools version number.
* Construct per-sample consensus to reduce the effect of nearby SNPs in INDEL
calling. This reduces the power but improves specificity.
* Fixed an integer overflow in INDEL calling. This bug produces wrong INDEL
genotypes for longer short INDELs, typically over 10bp.
* Fixed an out-of-boundary bug in mpileup when the read base is `N’.