Test-drive assessment on als9c2

Following the assessment thread done earlier , doing the real assessment on als9c2 with the focus on the following metrics.

1. ti/tv ratio

2. number of raw and filtered SNV

3. number of indel called by samtools

4. proportion of heterozygous SNVs on the x chr compared to the total number of SNVs in the X

5. concordance of SNVs with genotyping chip computed with SVA

6. overlap of filtered autosomal SNVs with dbsnp

7. Ratio of homo to hetero SNVs


For raw bam files

Here are the major steps

Input files: a bam file (with or without RG and/or recalibration)
Systems involved: dscr & lsrc
Software involved: samtools

Step One: get variant call file (samtools mpileup –> vcf format)

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/combined_rmdup.bam
3. OutputDir: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/combined_rmdup.bam /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/

(This has normally been done for raw bam file in the existing pipeline)

There has been error complaining about samtools mpileup, it turns out that it is error with read group information! Well, re-did the “add read group” step, still NOT working!!

samtools: bam_plcmd.c:596: group_smpl: Assertion `id >= 0 && id n’ failed.

On the recalibrated bam , no_RG case file (without RG due to error)

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh
2. Input: /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam
3. OutputDir: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
5. Sent May 18th, 2011 @10:25 a.m.; finished by May 23rd @4:49 a.m.

Unfortunately, there has been errors with process, ended up in variant call truncated at chromosome 9! Could be “disk space issue”.

Comparing size of file1:

For no_RG case:
[jl407@head4 noRG_reCal]$ ls -al file1
-rw-r–r– 1 jl407 chavi 264284334703 May 23 04:49 file1

For raw bam case:
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/file1
-rw-r–r– 1 jl407 chavi 264495324652 May 4 08:52 /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/file1

Compare bam file:
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/GATK/recalBam/
total 157200748
drwxr-xr-x 2 jl407 chavi 4096 May 18 11:39 .
drwxr-xr-x 11 jl407 chavi 4096 May 23 15:11 ..
-rw-r–r– 1 jl407 chavi 160649148678 May 18 10:13 als9c2.build37.recal.noRG.bam
-rw-r–r– 1 jl407 chavi 8763904 May 18 11:39 als9c2.build37.recal.noRG.bam.bai
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/combined_rmdup.bam
-rw-r–r– 1 jl407 chavi 98071917844 Apr 29 14:23 /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/combined_rmdup.bam

Need to re-run this, starts on May 24th, @ 4:00 p.m.
Step two: Split vcf files by snps and indels

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/genome_vcf_files, this file contains a list of var_flt_vcf to process.

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf

For no_RG case:

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/genome_vcf_files, this file contains a list of var_flt_vcf to process.

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf

Step three: Filter vcf variants

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_filelist snp

vcf_snp_filelist contains a list of var_flt_vcf.snp to process.
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp

For no_RG case:

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_filelist snp

vcf_snp_filelist contains a list of var_flt_vcf.snp to process.
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp

Result: should be in the following format, two files with the same name as your variant file will be created:
.filtered (variants that passed QC threshold)
.filtered.out (variants which failed QC)

Step four: ti/tv ratio

Dependency:

Java path: export PATH=/home/csem/tm103/DSCR/jre1.6.0_13/bin:$PATH
GATK package: gatk_dir=/nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083
Reference DB (build 37): reference=/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta

Commands to issue

1. Samples ( als9c2) dir:/nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/
2. Script :/nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Output: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/gatk_titv.csv
5. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/

For no_RG case: Changes in Italic

1. Two input information:
a. Samples ( als9c2) dir:/nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
b. Output file name: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/als9c2.build37.noRG.recal.gatk_titv.csv
2. Script :/nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/ als9c2.build37.noRG.recal.gatk_titv.csv

FIXME: this output needs to be “sample” oriented!!!, i.e. als9c2.build37.precal.gatk_titv.csv

Step five: Homozygous/heterozygous SNV ratio

Check to see if the genotype is listed in the 10th column in the VCF file

Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl
Usage: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl vcf_file > printTO
Procedure: cd to “/nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/” and issue command: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp.filtered > als9c2.build37.precal.homhet.snp.ratio.hets

For noRG_recal case:

Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl
Usage: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl vcf_file > printTO
Procedure: cd to “/nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/” and issue command: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp.filtered > als9c2.build37.noRG.recal.homhet.snp.ratio.hets

Step six: Concordance check with SVA

(1) After the “variant call” in step one, generate bco files in the DSCR

Use the following 2 lines from this script
/nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco.sh:


Modify all the parameters, including Java path, samtools location, reference DB, etc.

export PATH=/home/csem/tm103/DSCR/jre1.6.0_13/bin:$PATH
sam_dir=/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a
reference_fasta_file=/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta

If it runs manually, a few subdirectories need to be created properly.

a. bco_dir=$combined_dir/bco
b. $bco_dir/$sample
c. seqsata_sample_dir=/nfs/testrun/GATK-recal/assessmentComp/samples/$sample (for the results to be transferred back to our sva clusters).


Example command:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/ /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/bcftools/bcftools view -cg /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//file1 > /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bcftools_file
java -jar /nfs/chgv/seqanalysis2/GATK/scripts/vcf2bco.jar /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bcftools_file /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bco/
rsync -a –partial –timeout=10000 -e ‘ssh’ -r /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bco sva6.lsrc.duke.edu:/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

For noRG_case:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/ /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/bcftools/bcftools view -cg /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//file1 > /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bcftools_file

java -jar /nfs/chgv/seqanalysis2/GATK/scripts/vcf2bco.jar /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bcftools_file /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bco/

rsync -a –partial –timeout=10000 -e ‘ssh’ -r /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bco sva6.lsrc.duke.edu:/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal

At last step, those bco files should be transferred back to LSRC

(2) Create an SVA project (meaning get a .gsap file created)
Since this is alignment against build 37, there are many changes to the sva project. Thanks to Dongliang, who helped me with the set ups:

Currently,
The new build is existing at: /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

The .gsap file needs to be modified accordingly:

1. Menu File -> Open example project file
2. Copy all the lines after line “#Reference database version”
3. Replace the lines in the same place in your build 36 .gsap project file 4. Remove all the starting “#” in the “[REFBIN]=” lines. Because your project has all chromosomes, while the example only has X.
4. At last, Dongliang had to modified the permission for ./datasource/ensembl/61_37f/variation_feature2.txt

MODIFIED .gsp file!! May 18th, 2011

.GSAP file for SVA project: /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal/vcf_als9c2_precal.gsap


(3) Topstrand file names for als9c2:

/nfs/testrun/GATK-recal/svaprojects/vcf/genome/topstrandfilenames (Illumina created/provided this file)

/nfs/testrun/GATK-recal/svaprojects/vcf/genome/topstrandfilenames, the content in this file is as following.
SAMPLE_ID GENOTYPING_ID TopStrand_files
als9c2, als009c2, /nfs/sva01/SequenceAnalyses/TopStrandFiles/als006-015_10samples__MAY_4_10_TopStrand.csv

(4) Use this version of SVA: (It is about 120 mitues.

/nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

(Make sure to uncheck the filtering option in sva)

With Xwin, I was able to ssh to sva0.lsrc.duke.edu and perform the “annotation” . Started @7:30 a.m. May 19th, 2011.

(5) Check concordance:

SVA version: /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

Case I: Once “annotate” is done, go directly to the following steps to get concordance.

Step 1. From the SVA UI, choose “summary -> check SNP quality using genotyping dataset”
Step 2: Filling in information for four fields, information is stored at

Four fields
Genoptying final report: //nfs/sva01/SequenceAnalyses/TopStrandFiles/als006-015_10samples__MAY_4_10_TopStrand.csv
sample ID (sequencing): als9c2
sample ID (genotyping): als009c2
output check report to: /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

Step 3: Check “use the SNP list derived from the loaded final report”
Step 4: Click “Create”

Case II: If the project is closed, load the project by clicking “load” button and go back to the steps mentioned above.

Results:

For pre-calibration case

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): als9c2
Sample ID (genotyping): als009c2
Note: 584882 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 297962 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 148881 SNPs on bottom strand (Illumina).
– Among these, 296597 SNPs are found in genotype data (user SNP list) .
– Among the 297962 SNPs loaded from resequencing project, there are:
294755 match with genotype data (user SNP list);
1842 mismatch;
1365 not found in genotype data (user SNP list).
– Among the 1842 mismatches, there are
21 hom (seq) – hom (geno)
32 hom (seq) – het (geno)
18 het (seq) – het (geno)
1771 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.nomatch.txt
[Part 2] N = 286770
– Among the 286770 SNPs included in the genotyping chip but not derived from sequencing, there are:
286767 that can be correctly mapped to reference sequence;
Among these 286767 there are 281402 matches;
and 5365 mismatches, including 3699 het (geno) and 1666 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.part2.nomatch.txt

Results for recalibrated results

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): als9c2
Sample ID (genotyping): als009c2
Note: 584882 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 299193 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 149493 SNPs on bottom strand (Illumina).
– Among these, 297794 SNPs are found in genotype data (user SNP list) .
– Among the 299193 SNPs loaded from resequencing project, there are:
296497 match with genotype data (user SNP list);
1297 mismatch;
1399 not found in genotype data (user SNP list).
– Among the 1297 mismatches, there are
21 hom (seq) – hom (geno)
32 hom (seq) – het (geno)
20 het (seq) – het (geno)
1224 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.nomatch.txt
[Part 2] N = 285979
– Among the 285979 SNPs included in the genotyping chip but not derived from sequencing, there are:
282538 that can be correctly mapped to reference sequence;
Among these 282538 there are 138818 matches;
and 143720 mismatches, including 2429 het (geno) and 141291 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.part2.nomatch.txt

As the results show, we see improvement on the concordance with both part I and part II. And, it is more important that we see improvement with part II.

So the question exists, why/what causes low concordance check with genotyp chip results???


Step seven: Overlap of filtered autosomal SNVs with dbsnp

Here we need a little bit more attention. Since we are trying to compare analysis against build 37, we need snp information for build 37. It is available at NCBI/dbsnp

Download dbsnp file as following:

1. Download b132_SNPChrPosOnRef_37_1.bcp.gz
2. Check download completeness
[jl407@head4 refDB]$ md5sum b132_SNPChrPosOnRef_37_1.bcp.gz
07acf58908cf857b62e9de99a1d56533 b132_SNPChrPosOnRef_37_1.bcp.gz
[jl407@head4 refDB]$ cat b132_SNPChrPosOnRef_37_1.bcp.gz.md5
MD5(b132_SNPChrPosOnRef_37_1.bcp.gz)= 07acf58908cf857b62e9de99a1d56533

1. Samples (info) located: /nfs/chgv/seqanalysis2/GATK/doc/snplistFile
2. Script (dbsnp_overlap.pl) located: /nfs/chgv/seqanalysis2/GATK/scripts/
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap_b37.pl /nfs/chgv/seqanalysis2/GATK/doc/snplistFile
5. Results were written to log file

For no_RG case:

The only change is where the var_flt_vcf.snp.filtered is located:

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp.filtered
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp.filtered

Here is the concordance checking results: concordance_als9c2_b37

July 19th, 2011

Wanted to test “using 1M vs snp loaded from the project”.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.