Adding read group information

There has been much more headache adding this piece of information than I have expected. And, I think it is worthwhile to create a new thread just for this topic.

Ultimately, we would like to add this information into our SAMPE/SAMSE step using the -r option.

Working directory: /home/chavi/jl407/GATK-testing/testAddRG

Recalibration steps and expected output files:

Step 1 This should disappear once the new pipeline is updated (a -r option should be added at BWA step), for now, we need to manually add “read group” information and merge the bam files. But, there has been an error using this option .

Another source with details on the samtools options are here .
Merge BAM files with read group information, and further processing the bam file etc.: sampleName.RG.rmdup.bam

Just for testing purposes, will add read group into each individual bam file. A help was posted at SEQanswers forum by freeseek as:

echo -e “@RGtID:gatSM:hstLB:gatPL:Illumina” > rg.txt
samtools view -h ga.bam | cat rg.txt – | awk ‘{ if (substr($1,1,1)==”@”) print; else printf “%stRG:Z:gan”,$0; }’ | samtools view -uS – | samtools rmdup – – | samtools rmdup -s – aln.bam

With my modification as

echo -e “@RGtID:gatSM:hstLB:gatPL:Illumina” > rg.txt
samtools view -h ga.bam | cat rg.txt – | awk ‘{ if (substr($1,1,1)==”@”) print; else printf “%stRG:Z:gan”,$0; }’ | samtools view -bS -o aln.bam –

It did NOT work!!!

Take a look at the following screen shot:

[jl407@head4 testAddRG]$ samtools view -H /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.3.RG.bam
@RG ID:Run_1_3 SM:Run_1_3 LB:ga PL:Illumina

[jl407@head4 testAddRG]$ samtools view /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.3.RG.bam | more
…………….
XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 RG:Z:ga
………………

Each step takes 20 minutes!!

To view where the “read group has been properly” added, use the samtools view option.

OK, let’s try the most dumb way!!!

Use script: /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/addRG.sh, which sends this command:
/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/samtools merge -rh /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/rg.txt – /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.2.bam /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.3.bam /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.4.bam /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run1/s.5.bam /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run2/s.7.bam /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/Run2/s.8.bam | samtools rmdup – – | samtools rmdup -s – als9c2_rmdup_RG.bam

Sent to run @ May 18th, 9:11 a.m. Well, it worked!! Finished @ May 19th, 23:04

However, I need to “sort” it at the last step.

Test-drive assessment on als9c2

Following the assessment thread done earlier , doing the real assessment on als9c2 with the focus on the following metrics.

1. ti/tv ratio

2. number of raw and filtered SNV

3. number of indel called by samtools

4. proportion of heterozygous SNVs on the x chr compared to the total number of SNVs in the X

5. concordance of SNVs with genotyping chip computed with SVA

6. overlap of filtered autosomal SNVs with dbsnp

7. Ratio of homo to hetero SNVs


For raw bam files

Here are the major steps

Input files: a bam file (with or without RG and/or recalibration)
Systems involved: dscr & lsrc
Software involved: samtools

Step One: get variant call file (samtools mpileup –> vcf format)

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/combined_rmdup.bam
3. OutputDir: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/combined_rmdup.bam /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/

(This has normally been done for raw bam file in the existing pipeline)

There has been error complaining about samtools mpileup, it turns out that it is error with read group information! Well, re-did the “add read group” step, still NOT working!!

samtools: bam_plcmd.c:596: group_smpl: Assertion `id >= 0 && id n’ failed.

On the recalibrated bam , no_RG case file (without RG due to error)

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh
2. Input: /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam
3. OutputDir: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b37.sh /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
5. Sent May 18th, 2011 @10:25 a.m.; finished by May 23rd @4:49 a.m.

Unfortunately, there has been errors with process, ended up in variant call truncated at chromosome 9! Could be “disk space issue”.

Comparing size of file1:

For no_RG case:
[jl407@head4 noRG_reCal]$ ls -al file1
-rw-r–r– 1 jl407 chavi 264284334703 May 23 04:49 file1

For raw bam case:
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/file1
-rw-r–r– 1 jl407 chavi 264495324652 May 4 08:52 /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/file1

Compare bam file:
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/GATK/recalBam/
total 157200748
drwxr-xr-x 2 jl407 chavi 4096 May 18 11:39 .
drwxr-xr-x 11 jl407 chavi 4096 May 23 15:11 ..
-rw-r–r– 1 jl407 chavi 160649148678 May 18 10:13 als9c2.build37.recal.noRG.bam
-rw-r–r– 1 jl407 chavi 8763904 May 18 11:39 als9c2.build37.recal.noRG.bam.bai
[jl407@head4 noRG_reCal]$ ls -al /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/combined_rmdup.bam
-rw-r–r– 1 jl407 chavi 98071917844 Apr 29 14:23 /nfs/chgv/seqanalysis2/ALIGNMENT/samples/als9c2/build37/combined/combined_rmdup.bam

Need to re-run this, starts on May 24th, @ 4:00 p.m.
Step two: Split vcf files by snps and indels

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/genome_vcf_files, this file contains a list of var_flt_vcf to process.

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf

For no_RG case:

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/genome_vcf_files, this file contains a list of var_flt_vcf to process.

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf

Step three: Filter vcf variants

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_filelist snp

vcf_snp_filelist contains a list of var_flt_vcf.snp to process.
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp

For no_RG case:

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_filelist snp

vcf_snp_filelist contains a list of var_flt_vcf.snp to process.
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp

Result: should be in the following format, two files with the same name as your variant file will be created:
.filtered (variants that passed QC threshold)
.filtered.out (variants which failed QC)

Step four: ti/tv ratio

Dependency:

Java path: export PATH=/home/csem/tm103/DSCR/jre1.6.0_13/bin:$PATH
GATK package: gatk_dir=/nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083
Reference DB (build 37): reference=/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta

Commands to issue

1. Samples ( als9c2) dir:/nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/
2. Script :/nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Output: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/gatk_titv.csv
5. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/

For no_RG case: Changes in Italic

1. Two input information:
a. Samples ( als9c2) dir:/nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/
b. Output file name: /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/als9c2.build37.noRG.recal.gatk_titv.csv
2. Script :/nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/ als9c2.build37.noRG.recal.gatk_titv.csv

FIXME: this output needs to be “sample” oriented!!!, i.e. als9c2.build37.precal.gatk_titv.csv

Step five: Homozygous/heterozygous SNV ratio

Check to see if the genotype is listed in the 10th column in the VCF file

Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl
Usage: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl vcf_file > printTO
Procedure: cd to “/nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/” and issue command: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp.filtered > als9c2.build37.precal.homhet.snp.ratio.hets

For noRG_recal case:

Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl
Usage: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl vcf_file > printTO
Procedure: cd to “/nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/” and issue command: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp.filtered > als9c2.build37.noRG.recal.homhet.snp.ratio.hets

Step six: Concordance check with SVA

(1) After the “variant call” in step one, generate bco files in the DSCR

Use the following 2 lines from this script
/nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco.sh:


Modify all the parameters, including Java path, samtools location, reference DB, etc.

export PATH=/home/csem/tm103/DSCR/jre1.6.0_13/bin:$PATH
sam_dir=/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a
reference_fasta_file=/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta

If it runs manually, a few subdirectories need to be created properly.

a. bco_dir=$combined_dir/bco
b. $bco_dir/$sample
c. seqsata_sample_dir=/nfs/testrun/GATK-recal/assessmentComp/samples/$sample (for the results to be transferred back to our sva clusters).


Example command:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/ /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/bcftools/bcftools view -cg /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//file1 > /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bcftools_file
java -jar /nfs/chgv/seqanalysis2/GATK/scripts/vcf2bco.jar /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bcftools_file /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bco/
rsync -a –partial –timeout=10000 -e ‘ssh’ -r /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal//bco sva6.lsrc.duke.edu:/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

For noRG_case:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build37.sh /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/ /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/bcftools/bcftools view -cg /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//file1 > /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bcftools_file

java -jar /nfs/chgv/seqanalysis2/GATK/scripts/vcf2bco.jar /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bcftools_file /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bco/

rsync -a –partial –timeout=10000 -e ‘ssh’ -r /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal//bco sva6.lsrc.duke.edu:/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal

At last step, those bco files should be transferred back to LSRC

(2) Create an SVA project (meaning get a .gsap file created)
Since this is alignment against build 37, there are many changes to the sva project. Thanks to Dongliang, who helped me with the set ups:

Currently,
The new build is existing at: /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

The .gsap file needs to be modified accordingly:

1. Menu File -> Open example project file
2. Copy all the lines after line “#Reference database version”
3. Replace the lines in the same place in your build 36 .gsap project file 4. Remove all the starting “#” in the “[REFBIN]=” lines. Because your project has all chromosomes, while the example only has X.
4. At last, Dongliang had to modified the permission for ./datasource/ensembl/61_37f/variation_feature2.txt

MODIFIED .gsp file!! May 18th, 2011

.GSAP file for SVA project: /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal/vcf_als9c2_precal.gsap


(3) Topstrand file names for als9c2:

/nfs/testrun/GATK-recal/svaprojects/vcf/genome/topstrandfilenames (Illumina created/provided this file)

/nfs/testrun/GATK-recal/svaprojects/vcf/genome/topstrandfilenames, the content in this file is as following.
SAMPLE_ID GENOTYPING_ID TopStrand_files
als9c2, als009c2, /nfs/sva01/SequenceAnalyses/TopStrandFiles/als006-015_10samples__MAY_4_10_TopStrand.csv

(4) Use this version of SVA: (It is about 120 mitues.

/nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

(Make sure to uncheck the filtering option in sva)

With Xwin, I was able to ssh to sva0.lsrc.duke.edu and perform the “annotation” . Started @7:30 a.m. May 19th, 2011.

(5) Check concordance:

SVA version: /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1_b37/

Case I: Once “annotate” is done, go directly to the following steps to get concordance.

Step 1. From the SVA UI, choose “summary -> check SNP quality using genotyping dataset”
Step 2: Filling in information for four fields, information is stored at

Four fields
Genoptying final report: //nfs/sva01/SequenceAnalyses/TopStrandFiles/als006-015_10samples__MAY_4_10_TopStrand.csv
sample ID (sequencing): als9c2
sample ID (genotyping): als009c2
output check report to: /nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal

Step 3: Check “use the SNP list derived from the loaded final report”
Step 4: Click “Create”

Case II: If the project is closed, load the project by clicking “load” button and go back to the steps mentioned above.

Results:

For pre-calibration case

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): als9c2
Sample ID (genotyping): als009c2
Note: 584882 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 297962 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 148881 SNPs on bottom strand (Illumina).
– Among these, 296597 SNPs are found in genotype data (user SNP list) .
– Among the 297962 SNPs loaded from resequencing project, there are:
294755 match with genotype data (user SNP list);
1842 mismatch;
1365 not found in genotype data (user SNP list).
– Among the 1842 mismatches, there are
21 hom (seq) – hom (geno)
32 hom (seq) – het (geno)
18 het (seq) – het (geno)
1771 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.nomatch.txt
[Part 2] N = 286770
– Among the 286770 SNPs included in the genotyping chip but not derived from sequencing, there are:
286767 that can be correctly mapped to reference sequence;
Among these 286767 there are 281402 matches;
and 5365 mismatches, including 3699 het (geno) and 1666 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/preCal.part2.nomatch.txt

Results for recalibrated results

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): als9c2
Sample ID (genotyping): als009c2
Note: 584882 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 299193 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 149493 SNPs on bottom strand (Illumina).
– Among these, 297794 SNPs are found in genotype data (user SNP list) .
– Among the 299193 SNPs loaded from resequencing project, there are:
296497 match with genotype data (user SNP list);
1297 mismatch;
1399 not found in genotype data (user SNP list).
– Among the 1297 mismatches, there are
21 hom (seq) – hom (geno)
32 hom (seq) – het (geno)
20 het (seq) – het (geno)
1224 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.nomatch.txt
[Part 2] N = 285979
– Among the 285979 SNPs included in the genotyping chip but not derived from sequencing, there are:
282538 that can be correctly mapped to reference sequence;
Among these 282538 there are 138818 matches;
and 143720 mismatches, including 2429 het (geno) and 141291 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/als9c2/noRG_reCal.part2.nomatch.txt

As the results show, we see improvement on the concordance with both part I and part II. And, it is more important that we see improvement with part II.

So the question exists, why/what causes low concordance check with genotyp chip results???


Step seven: Overlap of filtered autosomal SNVs with dbsnp

Here we need a little bit more attention. Since we are trying to compare analysis against build 37, we need snp information for build 37. It is available at NCBI/dbsnp

Download dbsnp file as following:

1. Download b132_SNPChrPosOnRef_37_1.bcp.gz
2. Check download completeness
[jl407@head4 refDB]$ md5sum b132_SNPChrPosOnRef_37_1.bcp.gz
07acf58908cf857b62e9de99a1d56533 b132_SNPChrPosOnRef_37_1.bcp.gz
[jl407@head4 refDB]$ cat b132_SNPChrPosOnRef_37_1.bcp.gz.md5
MD5(b132_SNPChrPosOnRef_37_1.bcp.gz)= 07acf58908cf857b62e9de99a1d56533

1. Samples (info) located: /nfs/chgv/seqanalysis2/GATK/doc/snplistFile
2. Script (dbsnp_overlap.pl) located: /nfs/chgv/seqanalysis2/GATK/scripts/
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap_b37.pl /nfs/chgv/seqanalysis2/GATK/doc/snplistFile
5. Results were written to log file

For no_RG case:

The only change is where the var_flt_vcf.snp.filtered is located:

als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/var_flt_vcf.snp.filtered
als9c2 /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/var_flt_vcf.snp.filtered

Here is the concordance checking results: concordance_als9c2_b37

July 19th, 2011

Wanted to test “using 1M vs snp loaded from the project”.

Test run on NA12878

Refer to the page on work@chgv

Without ReadGroup

There has been an error with adding the readgroup information, therefore, I am testing this procedure just like I did with earlier trial (without read group information)

Getting ready for NA12878 sampel recalibration:

Step 1: making symbolic links to raw bam file and raw bai file
Step 2: Send this script and start the process raw bam –> cntcor.csv –> recal bam –> index recal bam –> recal.cntcor.csv
Step 3: send the following command:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/GATK_recal.pl na12878 /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.recal.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam
–default_read_group na12878
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T TableRecalibration
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-I /nfs/chgv/seqanalysis2/GATK/rawBam/na12878.build37.raw.bam
–default_read_group na12878
–default_platform illumina
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.raw.bam.noRG.CntCov.csv
–out /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam

/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/samtools index /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083//GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build37.recal.noRG.bam
–default_read_group na12878
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/na12878.build37.recal.bam.noRG.CntCov.csv

Help needed from Jessica

Since Jessica has compared 1000 genome sample obtained locally to snp released from 1000 Genome sites, I would like to learn what she had gone through. The following would get me started:

1. Where is snp file stored locally
2. Assuming we have gone through the entire process, we should have:
a. ram bam (aligned to build36) with index
b. output from call_variant script using “samtools12”, maybe a copy from “samtools7”
3. Since Jessica does not have a way to compare vcf format directly, she converted VCF format SNP back to pileup format
a. a script that does this conversion
b. where are the output from the conversion
4. Assuming Jessica has converted “1000 genome vcf snp” to “pileup” format
a. where are the pileup format 1000 genome snp file
5. Comparing script, that compares two pileup snp files

Jessica’s response

Hi Jianying,

These are the scripts I used for SNV comparison. I don’t have a routine that compares two VCF files and my conversion from VCF to pileup is not perfect.

This script converts a VCF SNV file to pileup format:
/nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/vcf_topileup_snps.pl
The genotype column in the VCF file is hard coded as 9 (see 5th line in the code). You can change this accordingly.

Comparing two SNV files in pileup format:
I used this .sh script to do the SNP file comparison /nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/snp_file_comparison.sh
This is the perl script that is called inside the .sh script:
/nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/evaluation/compare_snp_files.pl

Jessica

Another email:

Hi Jianying,

Here are the answers to your questions:
1. Where is 1000 genome snp file stored locally? If we LOST it,
which file we should re-download from 1000 genome?
I think this file might have been deleted soon after Sam Dickson left.
I would recommend re-downloading it.

2. Assuming we have gone through the entire process, we should have:
a. Raw bam (aligned to build36) with index and c. Output files from “samtools7″ (in pileup format) This is located here: /nfs/seqsata01/ALIGNMENT/BWA_OUTPUT/na12878/combined

b. Output files from call_variant script using “samtools12″ (in vcf format) /nfs/chgv/seqanalysis/ALIGNMENT/QC/samtools_version/sam12/genome/na12878/var_flt_vcf

3. Since we do not have a way to compare vcf format directly,
Jessica converted VCF format SNP back to pileup format a. The script that does this conversion Sent in my previous email this morning.

4. Assuming Jessica has converted “1000 genome vcf snp” to “pileup” format
a. Where are the pileup format 1000 genome snp file Please use my script to do re-convert the VCF into pileup format.

5. Comparing script, that compares two pileup snp files, and
report that it generated.
Sent in my previous email this morning.

Please let me know if you have any additional questions.

Jessica

Help from Dongliang to download 1000 genome snp file in vcf format:

Here are the links to the download:

I downloaded: CEU.trio.2010_03.genotypes.vcf.gz to /nfs/testrun/GATK-recal/snp_1000G_b36

Here are the first 20 lines:

[jl407@sva snp_1000G_b36]$ head -20 CEU.trio.2010_03.genotypes.vcf
##fileformat=VCFv4.0
##INFO=
##INFO=
##INFO=
##INFO=
##reference=human_b36_both.fasta
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878
1 52066 rs28402963 T C . PASS AA=C;DP=84 GT:GQ:DP 1/0:44:23 1/0:43:20 1/0:70:36
1 695745 . G A . PASS AA=.;DP=124 GT:GQ:DP 1|0:100:34 0|0:62:20 1|0:100:56
1 742429 rs3094315 G A . PASS AA=g;DP=132;HM2 GT:GQ:DP 1|1:100:38 1|1:59:30 1|1:100:44
1 742584 rs3131972 A G . PASS AA=a;DP=160;HM3 GT:GQ:DP 1|1:100:50 1|1:100:33 1|1:100:60
1 744366 rs3115859 G A . PASS AA=g;DP=127 GT:GQ:DP 1|1:80:31 1|1:100:34 1|1:100:45
1 746243 rs3131963 T A . PASS AA=t;DP=105 GT:GQ:DP 1|1:52:29 1|1:43:24 1|1:100:56
1 746775 rs6699990 A G . PASS AA=N;DP=120 GT:GQ:DP 0|1:100:34 0|0:89:30 0|0:100:46
1 747503 rs3115853 G A . PASS AA=-;DP=113 GT:GQ:DP 1|1:100:37 1|1:61:25 1|1:100:27
1 747597 rs4951929 C T . PASS AA=c;DP=129 GT:GQ:DP 1|1:100:36 1|1:86:28 1|1:100:56
1 747799 rs4951862 C A . PASS AA=c;DP=97 GT:GQ:DP 1|1:67:26 1|1:44:13 1|1:100:62

Strategy:
1. Run “comparing script” on two pileup format snp file
2. Run “conversion script” on a given vcf format to pileup
3. Maybe download vcf format snp file from 1000 genome
4. Run “recal” on build 36 na12878 bam file -> calibrated bam -> variant

Test-drive recalibration on als9c2

Refer to the page on work@chgv

Without ReadGroup

There has been an error with adding the readgroup information, therefore, I am testing this procedure just like I did with earlier trial (without read group information)

Input file preparation


Step 1: CountCovariates:

Script: /nfs/chgv/seqanalysis2/GATK/scripts/gatk_CountCov_noRG.sh
sample: als9c2
inputBAM: /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam
output: /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv
Command sent:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_CountCov_noRG.sh als9c2 /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam
–default_read_group als9c2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile : /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv

2. Step two: TableCalibration:

Script: /nfs/chgv/seqanalysis2/GATK/scripts/gatk_TableRecal_noRG.sh
sample: als9c2
inputBAM: /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam
tabRecalFile: /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv
recalBAM: /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam
Command sent:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_TableRecal_noRG.sh als9c2 /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam

java -jar -Xmx12g GenomeAnalysisTK.jar
-T TableRecalibration
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-I /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2.build37.raw.bam
–default_read_group als9c2
–default_platform illumina
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv
–out /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam
*Note,
-outputBam is deprecated, use -–out instead
Currently, not supporting parallel, only single thread
Time: 22.75 hours

2.1. Step 2.1: Indexing recalibrated BAM file:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_indexBAM.sh /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam

3. Step three: CountCovariates (on recalibrated bam file):

Script: /nfs/chgv/seqanalysis2/GATK/scripts/gatk_CountCov_noRG.sh
sample: als9c2
inputBAM: /nfs/chgv/seqanalysis2/GATK/recalBam
/als9c2.build37.recal.noRG.bam
output: /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.noRG.CntCov.csv
Command sent:

qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_CountCov_noRG.sh als9c2 /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.noRG.CntCov.csv

java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.noRG.bam
–default_read_group als9c2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.noRG.CntCov.csv

Totally: 8.9 hours

4. Step four: AnalyzeCovariates (on raw bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile

Command sent: qsub /scriptdir/gatk_analCov.sh /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.noRG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/samples/als9c2/preCal/

FIXME: BUT, the problem is that R script did not generate corresponding figures.

5. Step five: AnalyzeCovariates (on recalibrated bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile

Command sent: qsub /scriptdir/gatk_analCov.sh /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.noRG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/samples/als9c2/noRG_reCal/

!! Need to fix the R code for assessment plots

With ReadGroup

ERRORS

When used samtools mpileup, I found this error “[mpileup] 1 samples in 1 input files
samtools: bam_plcmd.c:596: group_smpl: Assertion `id >= 0 && id n’ failed.” and later found out that the RG was not added as I expected. Here are the answer to my question at SEQanswers. Therefore, I have to recreate all the recalibrated bam files.

Use the dumb way to add ReadGroup information, hopefully it works. Fingers crossed

If it works, a pipeline GATK run will be fired up after checking the bam file and indexing it.

Step one: Check he bam file

samtools view -H /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/als9c2_rmdup_RG.bam

[jl407@head4 als9c2]$ samtools view -H als9c2_rmdup_RG.bam
@RG ID:Run_1_2 SM:Run_1_2 LB:ga PL:Illumina
@RG ID:Run_1_3 SM:Run_1_3 LB:ga PL:Illumina
@RG ID:Run_1_4 SM:Run_1_4 LB:ga PL:Illumina
@RG ID:Run_1_5 SM:Run_1_5 LB:ga PL:Illumina
@RG ID:Run_2_7 SM:Run_2_7 LB:ga PL:Illumina
@RG ID:Run_2_8 SM:Run_2_8 LB:ga PL:Illumina

samtools view /nfs/chgv/seqanalysis/ALIGNMENT/samples/als9c2/als9c2_rmdup_RG.bam | grep “RG”

HWI-ST426_0113:3:1101:1497:1989#0 99 1 177398310 29 100M = 177398565 355 ATTTGTCCAAAATGAGGAATACAATCTCTTATTCCAAAGAGCTATAAAAAAGAAGATAATCAATCTACT
TGAGAAGCCAATTGTTGGCATACATAAAGGC 6-./11142D<@DD@D;DD@6@@@??9??;?=7?<:6::99942555535=;=6<66<<?6<?=::6:7<9<DD=9=??<==6496145<494 XT:A:U NM:i:1 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:
1 XO:i:0 XG:i:0 MD:Z:92C7 RG:Z:s.3

Step two: mv the bam file to /nfs/chgv/seqanalysis2/GATK/rawBam/

Step three: index the bam file

qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_indexBAM.sh /nfs/chgv/seqanalysis2/GATK/rawBam/

Step four: fire up the “GATK pipeline”

qsub /nfs/chgv/seqanalysis2/GATK/scripts/GATK_recal_RG.pl als9c2 /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2_rmdup_RG.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.RG.CntCov.csv /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.RG.bam /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.RG.CntCov.csv

Log file:

Process command:
java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2_rmdup_RG.bam
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.RG.CntCov.csv

Time to start the process is: Thu May 19 11:37:59 2011
Time when the process finishes is: Thu May 19 11:37:59 2011

Process command:
java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar
-T TableRecalibration
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-I /nfs/chgv/seqanalysis2/GATK/rawBam/als9c2_rmdup_RG.bam
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.raw.bam.RG.CntCov.csv
–out /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.RG.bam

Time to start the process is: Thu May 19 11:37:59 2011
Time when the process finishes is: Thu May 19 11:37:59 2011

Process command:
/nfs/chgv/seqanalysis/SOFTWARE/samtools-0.1.12a/samtools index /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.RG.bam

Time to start the process is: Thu May 19 11:37:59 2011
Time when the process finishes is: Thu May 19 11:37:59 2011

Process command:
java -jar -Xmx12g /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
-B:eval,VCF /home/chavi/jl407/GATK-testing/referenceFiles/00-All.vcf
-I /nfs/chgv/seqanalysis2/GATK/recalBam/als9c2.build37.recal.RG.bam
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile /nfs/chgv/seqanalysis2/GATK/resultsDir/als9c2.build37.recal.bam.RG.CntCov.csv

Time to start the process is: Thu May 19 11:37:59 2011
Time when the process finishes is: Thu May 19 11:37:59 2011

Consideration for updating pipleline

Here are some suggestions for updating our pipeline:

1. Shall pass in the parameter for major programs as well as the parameter setting, i.e. BWA, Samtools, etc. Should return “null” for those can be null

2. For BWA samse/pe procedure, need to add “-r” option for adding the read group information.

3. Should consolidate codes (bash and tcsh should be merged) and use subversion control on version

4. Validate all input, i.e. run_summary information from core facility. There was a version issue with our exiting pipeline . Mixture of pipeline version and sample_info file.

5. We shall document our “pipeline” with version

6. Mixture of different codes: perl, bash, tcsh, etc. These need to be documented.

7. To many sub-directories, hard to manage and keep track.

8. Too many copy and paste, which can potentially introduce errors. Need to eliminate these steps and create log for each step.

9. A database/lims will be helpful for storing all these information.

How we may do better

Part I, Version control on all third party software, BWA, samtools, Picard, GATK, etc. Version control on “pipeline” itself. Goal: easy to manage, migrate and update. Modularize and easy for plugging in.

Part II, Subversion control on all codes, cleaning up code, and consolidate of mixture codes

Part III, Implement (if not) ASPERA for file transferring for anything for data management

Part IV, Error checking/QC system

Part V, In-house database and LIMS to replace Excel spread sheet, for sample tracking. In-house database for keep log for any analysis. Web programming helper needed.

SVA project

This is my first time to use SVA, for recalibration project.

Project URL: Sequence Variant Analyzer

A sample .gsap file is looked like the following:

############################################################
#SequenceVariantAnalyzer project file.
#Authors: Dongliang Ge & David B. Goldstein
#Duke Institute for Genome Sciences & Policy, Center for Human Genome Variation
#Note: lines after the commenting symbol (#) will be ignored in the analysis.
############################################################

#Output file
[OUTPUT]=/nfs/svaprojects/vcf/genome/vcf_mchd002A2_recal.sva

[PEDINFO]=/nfs/svaprojects/master.ped
[COLLECTNONCARRIERINFO]=OFF

#######################################################################
#The following section lists the inputs and outputs used in the annotation. You need to specify your files.
#######################################################################

#Whole-exome sequence samples#

[INPUT]=mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/var_flt_vcf.snp.filtered.vcf
[COVERAGE]=1,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.1.bco
[COVERAGE]=2,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.2.bco
[COVERAGE]=3,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.3.bco
[COVERAGE]=4,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.4.bco
[COVERAGE]=5,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.5.bco
[COVERAGE]=6,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.6.bco
[COVERAGE]=7,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.7.bco
[COVERAGE]=8,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.8.bco
[COVERAGE]=9,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.9.bco
[COVERAGE]=10,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.10.bco
[COVERAGE]=11,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.11.bco
[COVERAGE]=12,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.12.bco
[COVERAGE]=13,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.13.bco
[COVERAGE]=14,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.14.bco
[COVERAGE]=15,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.15.bco
[COVERAGE]=16,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.16.bco
[COVERAGE]=17,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.17.bco
[COVERAGE]=18,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.18.bco
[COVERAGE]=19,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.19.bco
[COVERAGE]=20,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.20.bco
[COVERAGE]=21,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.21.bco
[COVERAGE]=22,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.22.bco
[COVERAGE]=X,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.X.bco
[COVERAGE]=Y,mchd002A2,/nfs/svaprojects/vcf/genome/mchd002A2/bco/mchd002A2_.Y.bco

[HMMSVMINLOD]=0

###########################################################################
#The following section lists an genome buffering option. Enabling this option will speedup the loading an
#integrated genome browser in this software. You will need to specify a file to store the buffered information.
###########################################################################

#Below is an example of a buffering file:
#[GenomeBrowserBuffer]=E:projectsSequenceVariantAnalyzerdatachr6smallTest.gbb
[GenomeBrowserBufferSwitch]=on
[GenomeBrowserBuffer]=/local2/dg48/projects/sequencing/gbb/cns.gb2

##################################################################
#The following section lists the annotation options.
##################################################################

#If specified as “on”, this option tells SequenceVariantAnalyzer to skip the annotation procedures
# and directly load the binary output instead.

[SKIPANNOTATION]=on

#Define upstream/downstream span
[UPSTREAM]=1000
[DOWNSTREAM]=1000
#Define intronic exon boundary. Intronic SNPs located within specified number of bases from
# exon boundary will fall into this category.
#This number has to be greater than 2.
[INTRON_EXON_BOUNDARY_INTO_INTRON]=8
[INTRON_EXON_BOUNDARY_INTO_EXON]=3
#Define threshold to group structural variations with recorded DGV CNVs – defualt is 50%.
[DGVCNVOVERLAP]=0.50
#Specify comprehensive or abbreviated annotation output. Note: comprehensive annotations include functionality for all relevant transcripts,
#while abbreviated output only includes the potentially most important one, but at a risk of losing information.
[DETAILED_OUTPUT]=on

############################################################################
#The following section lists the databases used in the annotation. You do not need to change them unless you
# need to use other versions of the databases.
############################################################################

#Reference database version
[COREVERSION]=50_36l

#Reference sequence, could be multiple lines, binary
[REFBIN]=1,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.1.fa.bin
[REFBIN]=2,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.2.fa.bin
[REFBIN]=3,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.3.fa.bin
[REFBIN]=4,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.4.fa.bin
[REFBIN]=5,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.5.fa.bin
[REFBIN]=6,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.6.fa.bin
[REFBIN]=7,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.7.fa.bin
[REFBIN]=8,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.8.fa.bin
[REFBIN]=9,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.9.fa.bin
[REFBIN]=10,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.10.fa.bin
[REFBIN]=11,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.11.fa.bin
[REFBIN]=12,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.12.fa.bin
[REFBIN]=13,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.13.fa.bin
[REFBIN]=14,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.14.fa.bin
[REFBIN]=15,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.15.fa.bin
[REFBIN]=16,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.16.fa.bin
[REFBIN]=17,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.17.fa.bin
[REFBIN]=18,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.18.fa.bin
[REFBIN]=19,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.19.fa.bin
[REFBIN]=20,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.20.fa.bin
[REFBIN]=21,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.21.fa.bin
[REFBIN]=22,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.22.fa.bin
[REFBIN]=X,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.X.fa.bin
[REFBIN]=Y,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.Y.fa.bin
[REFBIN]=MT,./datasource/ensembl/50_36l/Homo_sapiens.NCBI36.50.dna.chromosome.MT.fa.bin

#RepeatMasker result folder
[REPEATMASKERFOLDER]=./datasource/repeatMasker

#A switch to turn on/off function for generating ‘N’ bases in reference sequence
[REGENERATEREFNCALLS]=OFF

#The following are the annotation databases
#Core
[GENEID]=./datasource/ensembl/50_36l/gene_stable_id.txt
[TRANSCRIPT]=./datasource/ensembl/50_36l/transcript.txt
[TRANSCRIPTID]=./datasource/ensembl/50_36l/transcript_stable_id.txt
[TRANSLATION]=./datasource/ensembl/50_36l/translation.txt
[EXON]=./datasource/ensembl/50_36l/exon.txt
[EXONTRANSCRIPT]=./datasource/ensembl/50_36l/exon_transcript.txt
[XREF]=./datasource/ensembl/50_36l/xref.txt
[OBJECTXREF]=./datasource/ensembl/50_36l/object_xref.txt
[GENE]=./datasource/ensembl/50_36l/gene.txt
[SEQREGION]=./datasource/ensembl/50_36l/seq_region.txt
#GO
[GO]=./datasource/ensembl/50_36l/term.txt

#Existing variations
[VARIATION]=./datasource/ensembl/50_36l/variation.txt
[VARIATIONFEATURE1]=./datasource/ensembl/50_36l/variation_feature.001.txt
[VARIATIONFEATURE2]=./datasource/ensembl/50_36l/variation_feature.002.txt
[VARIATIONALLELE1]=./datasource/ensembl/50_36l/allele.001.txt
[VARIATIONALLELE2]=./datasource/ensembl/50_36l/allele.002.txt
[VARIATIONSAMPLE]=./datasource/ensembl/50_36l/sample.txt

#Protein sequence variation annotation (pre-calculated)
[PROTEINVAR]=./datasource/proteinpoly/mapp.out
#Hapmap
[HAPMAP]=./datasource/hapmap/hapmap_r23a.bim
#Illumina 1M chip
[ILLUMINA1M]=./datasource/illumina/Human1Mv1_snptable.txt
#CNV
[DGVCNV]=./datasource/CNV/DGV/variation.hg18.v3.txt
[CNVTAGGING]=./datasource/CNV/tagging/CNV_tagging_SNPs.txt
#KEGG pathway files
[KEGGMAPTITLE]=./datasource/kegg/Aug2508/map_title.tab
[KEGGGENEPATHWAY]=./datasource/kegg/Aug2508/genes_pathway.list
#Venter’s variants
[VENTERSNPS]=./datasource/venter/HuRef.InternalHuRef-NCBI.snp.bin
[VENTERINDELS]=./datasource/venter/HuRef.InternalHuRef-NCBI.indel.bin
#1000 Human Genome SNP
[1000HUMANGENOMESNP]=./datasource/1kgenome/CEU.frequency.filter
#OMIM Gene map
[OMIMGENEMAP]=./datasource/omim/genemap

#An *optional* protein sequence output
#[PROTEINFASTAOUTPUTFOLDER]=./datasource/ensembl/50_36l/proteinfasta

#######################################################
#End of the SequenceVariantAnalyzer project script file.
#######################################################