Re-calibration with GATK

Please refer to my site for details on cracking the GATK ;

Please refer to my site for details on Learning with the GATK (April 8th, 2011) ;

Preparation:

Java stuffs:

Three copies of GATK executable are located on /nfs/*/SOFTWARE
Java home and jre path were passed in in the scripts.

Input file preparation

Working script: /home/chavi/jl407/GATK-testing/workingDir/gatk_titv_sam12_CountCov.sh (derived from Jessica’s titv script)
Reference file:/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX/human_ref_36_50.fa
Input file: /nfs/chgv/seqanalysis2/ALIGNMENT/samples/mchd002A2/combined/combined_rmdup.bam
Output file: /home/chavi/jl407/GATK-testing/dataOutput/mchd002A2_gatk_recal.csv

I tried to run a test GATK process with the following command (on dscr),:

java -Xmx4g -jar /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar

-T CountCovariates
–assume_single_sample_reads mchd002A2
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output

Also, I noticed that once I run the CountVariates analysis, the software generates a *.rod.fai file

Now, I am transferring the process to our sva.lsrc cluster, with copy_data.pl drafted by Abanish Singh, it allows coping data from a drive mounted on one unix server to a drive mounted on a remote host using SSH connection.

April 13th, 2010
Now, I was able to test the multi-thread (12 processors on dscr) CountVariates, it seems that it does shorten the computing time, but NOT by 12 times as it claimed from Broad. We’ll see how long it takes.

OK, testing the same process with human dbsnp v4.0, dbsnp-ncbi


Preparing BAM files by adding read group information:

Since our merged BAM files from human genome sequence projects do NOT have the read group information, I have to add those information using the samtools .

A couple of available strategies available to us

1. Going forward, we should add this information by creating an extra step in the existing pipeline. With “-r” option in sampe or samse step.

2. For now, trying to add @RG information into the individual bam fils. Detail instruction has been posted at seqanswers forum site. Broad GATK site also listed the instruction to properly prepare the bam file for base calling score recalibration.

As a trial, using the available individual .bam file from dukscz0106, testing the procedure.

Raw file location: /nfs/testrun/GATK-recal/samples/dukscz0106/precal/Run*
Working directory: /home/jl407/project_2011/GATK_test/workingDir
Script: combine_bam.sh

Well, the trick/mistake here is to “sort” first before “merge”

To verify that the @RG has been incorporated into the combined .bam file, simply use samtools view -h .bam | more


Running recalibration step by step


1. Step one: CountCovariates:

java -jar -Xmx6g GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output
Totally: 8.9 hours

2. Step two: TableCalibration:

java -jar -Xmx6g GenomeAnalysisTK.jar
-T TableRecalibration
-R $reference
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-recalFile $tabRecalFile
–out $recalBAM
*Note,
-outputBam is deprecated, use –out instead
Currently, not supporting parallel, only single thread
Estimated: 22 hours


2.1. Step 2.1: Processing recalibrated BAM file:

Takes approx. 40 minutes for a genome bam file ~ 100GB. If the bam file has been properly sorted (which is the case in our alignment pipeline), a simple command will do the trick:

dir/samtools index sample_recal.bam

Instead of four steps go get a properly indexed bam file

But, need to be sent to the cluster.

3. Step three: CountCovariates (on recalibrated bam file):

java -jar -Xmx6g GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output
Totally: 8.9 hours


4. Step four: AnalyzeCovariates (on raw bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile


5. Step five: AnalyzeCovariates (on recalibrated bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile

!! Need to fix the R code for assessment plots

Assessing the gain from base calling score recalibration

Criteria:

Broad QC metrics:

1. QQ plot
2. RMSE report
3. Q30 distribution
4.

These steps have been tested and they are working properly. Now, we need to incorporate these into new pipeline with recalibration.

Jessica’s comparison on samtools version:

1. Number of raw/filtered SNP
2. Overlap of filtered autosomal SNVs with dbSNP
3. Ration of homo to hetero SNVs
4. Proportion of hetero on duckscz0106
5. Concordance of SNVs with genotyping chip computed with SVA
6. Ti/Tv ratio (check)
7. Number of indels
8. Indel genome sensitivity
9. Indel Positive Predictive Value (ppv)
10. SNV genotype sensitivity
11. SNV genotype ppv

Calling variant with samtools v0.1.12a

Using mchd002A2 as an example:

1. script: /home/chavi/jl407/GATK-testing/workingDir/call_var_from_bam.sh
2. input bam: /nfs/chgv/seqanalysis3/GATK/samples/mchd002A2/postCal/mchd002A2_combined_recal.bam
3. output directory:/nfs/chgv/seqanalysis3/GATK/samples/mchd002A2/postCal
4. logfile: /home/chavi/jl407/GATK-testing/log/calling_var_samtools12.out/err

A little warning from Jessica: it can take 12 – 24 hours! In fact, it takes up to 4 days!!

Starting from here, a series of assessment will be carried out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.