Test run on NA12878

Refer to the page on work@chgv

Without ReadGroup

There has been an error with adding the readgroup information, therefore, I am testing this procedure just like I did with earlier trial (without read group information)

Getting ready for NA12878 sampel recalibration:

Step 1: making symbolic links to raw bam file and raw bai file
Step 2: Send this script and start the process raw bam –> cntcor.csv –> recal bam –> index recal bam –> recal.cntcor.csv
Step 3: send the following command:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/GATK_recal.pl na12878 /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.recal.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam
–default_read_group na12878
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T TableRecalibration
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-I /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam
–default_read_group na12878
–default_platform illumina
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv
–out /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/samtools index /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam
–default_read_group na12878
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.recal.bam.noRG.CntCov.csv

Help needed from Jessica

Since Jessica has compared 1000 genome sample obtained locally to snp released from 1000 Genome sites, I would like to learn what she had gone through. The following would get me started:

1. Where is snp file stored locally
2. Assuming we have gone through the entire process, we should have:
a. ram bam (aligned to build36) with index
b. output from call_variant script using “samtools12”, maybe a copy from “samtools7”
3. Since Jessica does not have a way to compare vcf format directly, she converted VCF format SNP back to pileup format
a. a script that does this conversion
b. where are the output from the conversion
4. Assuming Jessica has converted “1000 genome vcf snp” to “pileup” format
a. where are the pileup format 1000 genome snp file
5. Comparing script, that compares two pileup snp files

Jessica’s response

Hi Jianying,

These are the scripts I used for SNV comparison. I don’t have a routine that compares two VCF files and my conversion from VCF to pileup is not perfect.

This script converts a VCF SNV file to pileup format:
/nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/vcf_topileup_snps.pl
The genotype column in the VCF file is hard coded as 9 (see 5th line in the code). You can change this accordingly.

Comparing two SNV files in pileup format:
I used this .sh script to do the SNP file comparison /nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/snp_file_comparison.sh
This is the perl script that is called inside the .sh script:
/nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/compare_snp_files.pl

Jessica

Another email:

Hi Jianying,

Here are the answers to your questions:
1. Where is 1000 genome snp file stored locally? If we LOST it,
which file we should re-download from 1000 genome?
I think this file might have been deleted soon after Sam Dickson left.
I would recommend re-downloading it.

2. Assuming we have gone through the entire process, we should have:
a. Raw bam (aligned to build36) with index and c. Output files from “samtools7″ (in pileup format) This is located here: /nfs/seqsata01/ALIGNMENT/BWA_OUTPUT/na12878/combined

b. Output files from call_variant script using “samtools12″ (in vcf format) /nfs/chgv/seqanalysis/ALIGNMENT/QC/samtools_version/sam12/genome/na12878/var_flt_vcf

3. Since we do not have a way to compare vcf format directly,
Jessica converted VCF format SNP back to pileup format a. The script that does this conversion Sent in my previous email this morning.

4. Assuming Jessica has converted “1000 genome vcf snp” to “pileup” format
a. Where are the pileup format 1000 genome snp file Please use my script to do re-convert the VCF into pileup format.

5. Comparing script, that compares two pileup snp files, and
report that it generated.
Sent in my previous email this morning.

Please let me know if you have any additional questions.

Jessica

Help from Dongliang to download 1000 genome snp file in vcf format:

Here are the links to the download:

I downloaded: CEU.trio.2010_03.genotypes.vcf.gz to /nfs/testrun/GATK-recal/snp_1000G_b36

Here are the first 20 lines:

[jl407@sva snp_1000G_b36]$ head -20 CEU.trio.2010_03.genotypes.vcf
##fileformat=VCFv4.0
##INFO=
##INFO=
##INFO=
##INFO=
##reference=human_b36_both.fasta
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878
1 52066 rs28402963 T C . PASS AA=C;DP=84 GT:GQ:DP 1/0:44:23 1/0:43:20 1/0:70:36
1 695745 . G A . PASS AA=.;DP=124 GT:GQ:DP 1|0:100:34 0|0:62:20 1|0:100:56
1 742429 rs3094315 G A . PASS AA=g;DP=132;HM2 GT:GQ:DP 1|1:100:38 1|1:59:30 1|1:100:44
1 742584 rs3131972 A G . PASS AA=a;DP=160;HM3 GT:GQ:DP 1|1:100:50 1|1:100:33 1|1:100:60
1 744366 rs3115859 G A . PASS AA=g;DP=127 GT:GQ:DP 1|1:80:31 1|1:100:34 1|1:100:45
1 746243 rs3131963 T A . PASS AA=t;DP=105 GT:GQ:DP 1|1:52:29 1|1:43:24 1|1:100:56
1 746775 rs6699990 A G . PASS AA=N;DP=120 GT:GQ:DP 0|1:100:34 0|0:89:30 0|0:100:46
1 747503 rs3115853 G A . PASS AA=-;DP=113 GT:GQ:DP 1|1:100:37 1|1:61:25 1|1:100:27
1 747597 rs4951929 C T . PASS AA=c;DP=129 GT:GQ:DP 1|1:100:36 1|1:86:28 1|1:100:56
1 747799 rs4951862 C A . PASS AA=c;DP=97 GT:GQ:DP 1|1:67:26 1|1:44:13 1|1:100:62

Strategy:
1. Run “comparing script” on two pileup format snp file
2. Run “conversion script” on a given vcf format to pileup
3. Maybe download vcf format snp file from 1000 genome
4. Run “recal” on build 36 na12878 bam file -> calibrated bam -> variant

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.