GATK assessment

This is the assessment on GATK recalibration, using our current variant calling pipeline.

Samples aligned to build36

dukscz0106
mchd002A2
na12878

Major steps:

Step 1: calling variants with samtools1.0.12a

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b36.sh
2. Input: /nfs/chgv/seqanalysis2/GATK/recalBam/dukscz0106.build36.recal.noRG.bam
3. OutputDir: /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b36.sh /nfs/chgv/seqanalysis2/GATK/recalBam/dukscz0106.build36.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/
5. Sent @7:50 a.m. June 11th, but the queue was very long!

For mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b36.sh /nfs/chgv/seqanalysis2/GATK/recalBam/mchd002A2.build36.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/samples/mchd002A2/noRG_recal/

For na12878, run on lsrc and results were transferred back to dscr.

For comparison purposed, run one on the dscr as well @1:55 p.m. June 23rd, 2010: qsub /nfs/chgv/seqanalysis2/GATK/scripts/call_var_from_bam_b36.sh /nfs/chgv/seqanalysis2/GATK/recalBam/na12878.build36.recal.noRG.bam /nfs/chgv/seqanalysis2/GATK/samples/na12878/noRG_reCal/


Here are a few QC metrics we can check

1. ti/tv ratio

2. number of raw and filtered SNV

3. number of indel called by samtools

4. proportion of heterozygous SNVs on the x chr compared to the total number of SNVs in the X

5. Ratio of homo to hetero SNVs

Split vcf files by snps and indels

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/var_flt_vcf (var_raw_bcf is binary file, where var_flt_vcf was generated from), also, this “var_flt_vcf” is really “RAW” SNV/INDEL file.
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/dukscz0106_vcf_files , this file contains a var_flt_vcf file to process.

dukscz0106 /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/var_flt_vcf

5. For mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/mchd002A2_vcf_files

6. For na12878: qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/runningDir/na12878_vcf_files

7. For na12878_raw: qsub /nfs/chgv/seqanalysis2/GATK/scripts/split_snps_indels.pl /nfs/chgv/seqanalysis2/GATK/samples/na12878/no_reCal/na12878_noGATK_vcf_files

Filter vcf variants
/nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/sam12/vcf_varfilter.pl

Your variant list should be in the format;

Two files with the same name as your variant file will be created:
.filtered (variants that passed QC threshold)
.filtered.out (variants which failed QC)

Example usage:
qsub /nfs/chgv/seqanalysis/ALIGNMENT/QC/Scripts/sam12/vcf_varfilter.pl /nfs/chgv/seqanalysis/ALIGNMENT/QC/samtools_version/sam12/genome/vcf_snp_filelist snp

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl
2. Input: /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/var_flt_vcf.snp
3. RunningDir/OutputDir: /nfs/chgv/seqanalysis2/GATK/runningDir
4. cd to the RunningDir, and submit job:
qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_dukscz0106 snp/,
vcf_snp_filelist contains a list of var_flt_vcf.snp to process.

dukscz0106 /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/var_flt_vcf.snp

5. For mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_mchd002A2 snp

6. For na12878: qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/runningDir/vcf_snp_na12878 snp

7. For na12878_raw: qsub /nfs/chgv/seqanalysis2/GATK/scripts/vcf_varfilter.pl /nfs/chgv/seqanalysis2/GATK/samples/na12878/no_reCal//vcf_snp_na12878_raw snp

ti/tv ratio

1. Samples ( dukscz0106 & mchd002A2) located: /nfs/chgv/seqanalysis2/GATK/samples/
2. Script (gatk_titv_b36.sh) located: /nfs/chgv/seqanalysis2/GATK/scripts/
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Usage: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv_b36.sh /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal dukscz0106.gatk_titv.csv
5. For mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv_b36.sh /nfs/chgv/seqanalysis2/GATK/samples/mchd002A2/noRG_recal/ mchd002A2.gatk_titv.csv
6. For na12878: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv_b36.sh /nfs/chgv/seqanalysis2/GATK/samples/na12878/noRG_reCal_lsrc/ na12878.gatk_titv.csv

7. For na12878_raw: qsub /nfs/chgv/seqanalysis2/GATK/scripts/gatk_titv_b36.sh /nfs/chgv/seqanalysis2/GATK/samples/na12878/no_reCal/ na12878.gatk_titv.csv

Homozygous/heterozygous SNV ratio
Check to see if the genotype is listed in the 10th column in the VCF file

Script: /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl
Usage: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl vcf_file > printTO
Example: perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl var_flt_vcf.snp.filtered > vcf.snp.filtered.hets
Usage: cd to /workingDir/ and run the script:

perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl var_flt_vcf.snp.filtered > dukscz0106.hets

perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl var_flt_vcf.snp.filtered > mchd002A2.hets

perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl var_flt_vcf.snp.filtered > na12878.hets

perl /nfs/chgv/seqanalysis2/GATK/scripts/vcf_hets.pl var_flt_vcf.snp.filtered > na12878_raw.hets

Step 2: Generate SVA project with samtools1.0.12a

1. Script: /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build36.sh
2. Input(dscr): /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/file1
3. OutDir(lsrc)r: /nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal/
4. Submit job: qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build36.sh dukscz0106 /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/ /nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal/

5. Job for mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco_build36.sh mchd002A2 /nfs/chgv/seqanalysis2/GATK/samples/mchd002A2/noRG_recal/ /nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/noRG_reCal

For dukscz0106, sent @12:05 p.m. June 11th!
For mchd002a2, sent @8:36 p.m. June 20th!

Here we can check:


6. concordance of SNVs with genotyping chip computed with SVA

Step 3. Overlap of filtered autosomal SNVs with dbsnp

1. Samples ( dukscz0106 ) located: /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/noRG_reCal/var_flt_vcf.snp.filtered
2. Script (dbsnp_overlap.pl) located: /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap.pl
3. Log files: /nfs/chgv/seqanalysis2/GATK/logs/
4. Command: qsub /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap.pl /nfs/chgv/seqanalysis2/GATK/doc/dukscz0106.snp
5. For mchd002A2: qsub /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap.pl /nfs/chgv/seqanalysis2/GATK/doc/mchd002A2.snp
6. For na12878: qsub /nfs/chgv/seqanalysis2/GATK/scripts/dbsnp_overlap.pl /nfs/chgv/seqanalysis2/GATK/doc/na12878.snp
7. Results were written to log file


7. overlap of filtered autosomal SNVs with dbsnp

Learning from Jessica on variant call


Concordance check with SVA

(1) Generate bco files in the DSCR
Use the following 2 lines from this script
/nfs/chgv/seqanalysis2/GATK/scripts/bcftools_coverage_bco.sh:

Modify all the parameters, including Java path, samtools location, reference DB, etc.

If it runs manually, a few subdirectories need to be created properly.

a. bco_dir=$combined_dir/bco
b. $bco_dir/$sample
c. seqsata_sample_dir=/nfs/testrun/GATK-recal/assessmentComp/samples/$sample (for the results to be transferred back to our sva clusters).

Note: For mchd002A2, job sent May 2nd, @ 11:00 a.m. and finishes @ 1:02 a.m. May 3rd
Note: For dukscz0106, job sent May 3rd, @ 11:16 a.m. and finishes @ 3:30 a.m. May 4th
Usage: qsub bcftools_coverage_bco_jyl.sh /nfs/chgv/seqanalysis2/GATK/samples/dukscz0106/postCal dukscz0106

At last step, those bco files should be transferred back to LSRC

(2) Create an SVA project (meaning get a .gsap file created)

Example of SVA project:
/nfs/svaprojects/vcf/genome/vcf_genomes_3.gsap

Details see the explanation on the .gsap file

/nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/vcf_mchd002A2_recal.gsap
/nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/vcf_dukscz0106_recal.gsap

(3) Topstrand file names for the 2 samples:
/nfs/svaprojects/vcf/genome/topstrandfilenames (Illumina created/provided this file)

/nfs/testrun/GATK-recal/svaprojects/vcf/genome/topstrandfilenames, the content in this file is as following. The original copy was saved and individual one is used per sample.

mchd002A2, MCHD002, /nfs/sva01/SequenceAnalyses/TopStrandFiles/MCHD_Family__FEB_11_2010_TOP_FinalReport.csv

dukscz0106, Duk-Schiz0106DEF, /nfs/sva01/SequenceAnalyses/TopStrandFiles/Duke_Scz_16_samples__DEC_09_09_FinalReport_TOP.csv

(4) Use this version of SVA: (It is about 120 mitues.

/nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1 (how to use this program??)
(Make sure to uncheck the filtering option in sva)

Step 1. Grab any machine along the hallway and log in with netID/password
Step 2. Make sure that the drive has been mounted
Step 3. cd to the directory /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1/, as this is the version for new VCF format
Step 4. Lunch SVA, ./SequenceVariantAnalysis.sh
Step 5. From File -> Open, explore through to the directory that contains that .gsap file, make sure that all the parameters are set up properly
Step 6. Clicking “Annotating”

(5) Check concordance:

SVA version: /nfs/goldstein/goldsteinlab/software/SequenceVariantAnalyzer_1.1/

Case I: Once “annotate” is done, go directly to the following steps to get concordance.

Step 1. From the SVA UI, choose “summary -> check SNP quality using genotyping dataset”
Step 2: Filling in information for four fields, information is stored at
/nfs/testrun/GATK-recal/svaprojects/vcf/genome/opstrandfilenames, the content in this file is:
mchd002A2, MCHD002, /nfs/sva01/SequenceAnalyses/TopStrandFiles/MCHD_Family__FEB_11_2010_TOP_FinalReport.csv

Four fields
Genoptying final report: /nfs/sva01/SequenceAnalyses/TopStrandFiles/MCHD_Family__FEB_11_2010_TOP_FinalReport.csv
sample ID (sequencing): mchd002A2
sample ID (genotyping): MCHD002
output check report to: /nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2

Step 3: Check “use the SNP list derived from the loaded final report”
Step 4: Click “Create”

Case II: If the project is closed, load the project by clicking “load” button and go back to the steps mentioned above.

Results:

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): mchd002A2
Sample ID (genotyping): MCHD002
Note: 584882 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 300585 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 150265 SNPs on bottom strand (Illumina).
– Among these, 300351 SNPs are found in genotype data (user SNP list) .
– Among the 300585 SNPs loaded from resequencing project, there are:
299325 match with genotype data (user SNP list);
1026 mismatch;
234 not found in genotype data (user SNP list).
– Among the 1026 mismatches, there are
16 hom (seq) – hom (geno)
34 hom (seq) – het (geno)
21 het (seq) – het (geno)
955 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/noRG_reCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/noRG_reCal.nomatch.txt
[Part 2] N = 283814
– Among the 283814 SNPs included in the genotyping chip but not derived from sequencing, there are:
283814 that can be correctly mapped to reference sequence;
Among these 283814 there are 280139 matches;
and 3675 mismatches, including 2798 het (geno) and 877 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/noRG_reCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/mchd002A2/noRG_reCal.part2.nomatch.txt

For sample dukscz0106, the cooncordance check results are

Performing quality check using genotype data (user SNP list).
> Sample ID (sequencing): dukscz0106
Sample ID (genotyping): Duk-Schiz0106DEF
Note: 583754 SNPs are loaded from user SNP report.
– Chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
– 297679 SNPs (with rs# listed in genotype data (user SNP list) are loaded from your resequencing project.
– 148586 SNPs on bottom strand (Illumina).
– Among these, 297646 SNPs are found in genotype data (user SNP list) .
– Among the 297679 SNPs loaded from resequencing project, there are:
294629 match with genotype data (user SNP list);
3017 mismatch;
33 not found in genotype data (user SNP list).
– Among the 3017 mismatches, there are
20 hom (seq) – hom (geno)
124 hom (seq) – het (geno)
88 het (seq) – het (geno)
2785 het (seq) – hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal.nomatch.txt
[Part 2] N = 287131
– Among the 287131 SNPs included in the genotyping chip but not derived from sequencing, there are:
287128 that can be correctly mapped to reference sequence;
Among these 287128 there are 280499 matches;
and 6629 mismatches, including 5652 het (geno) and 977 hom (geno)
Detailed lists were written to:
/nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal.part2.match.txt
/nfs/testrun/GATK-recal/assessmentComp/samples/dukscz0106/noRG_reCal.part2.nomatch.txt

concordance ratio = (294629+280499)/(297646+287128) = 0.9835

QC metrics checked: comparison_to_report_JM

SVA concordance only: concordance_comparison

To compare details concordance checked done on raw bam file, all the sva project logs are located: /nfs/svaprojects/vcf/genome/concordance

For na12878 sample, we don’t have the genotype array data to compare to (as we did with in house samples). However, we do have the 1000 genome snp releases.

Step 1. Download the 1000 genome release
Step 2. Compare our in-house vcf file to the 1000 genome release.
Script located: /nfs/chgv/seqanalysis2/GATK/scripts
Script usage: qsub /nfs/chgv/seqanalysis2/GATK/scripts /dir/ /dir/
Script output: Results will be printed to log file.
Step 3. Comparison results are here: concordance_na12878

Dr. Hao Chen’s microarary project

1. Met with Dr. Hao Chen for her Microarray data analysis, the strategy to approach this request is as following:

a. Read to two references in the paper by Talor, J.M. et al 2009, PLoSOne (ref. 46 & 47) for pattern extraction analysis
b. Extract 27 possible patterns from the esophagus developing stage related genes, produce figure 2 like results in Talor’s paper
c. Will provide consultation for clustering analysis for Hao
d. Discussed the authorship for possible publication

Got it into GeneSpring GX11.0
1. With help from JP from Bioinformatics center I was able to enter the data into GeneSpring GX11.0
2. Need to create the technology first and follow the self-explanary steps
3. Create, project -> experiment
4. Do the analysis within GeneSpring

Combining data from preview study (further request from Dr. Chen, Xiaoxin)

Goal: To determine what genes (especially transcription factors) may play critical roles at each stage during esophageal development.

1. E8.25: “E8.25 definitive endoderm” will develop into future esophagus, stomach, liver, intestine… Presumably some genes should be overexpressed and some others underexpressed in oder to “specify” differentiation into E11.5 esophagus.
2. E11.5: Simple columnar epithelium, 1-2 layers
3. E15.5: Stratified squamous epithelium, 2-3 layers
4. P0: Stratified squamous epithelium, Early keratinization, 3-5 layers
5. P7: Stratified squamous epithelium, Late keratinization, 3-5 layers

Two kinds of array data are available for analysis:
1. Previous arry data: E8.25 endoderm, E11.5 esophagus (Sherwood RI, Chen TY, Melton DA. Transcriptional dynamics of endodermal organ formation. Dev Dyn. 2009 Jan;238(1):29-42). Data can be download from EO datasets (GSE13040 record). There are 3 samples of “E8.25 definitive endoderm” and 3 samples of “E11.5 esophagus endoderm” in this dataset. Illumina mouseRef-8 v2 microarrays were used in this study.
2. Hao’s array data: E11.5 esophagus, E15.5 esophagus, P0 esophagus, P7 esophagus (3 samples of each). Initially we planned to include adult esophagus. Due to our own mistakes, we decide to move data analysis forward without these samples. You can access our original data from UNC. Hao gave you SAM data. I guess after we pool E8.25 data from a previous study, we need to do analysis again.

A few issues for you to consider:
1. How to pool these two kinds of data together into a speardsheet containing 15 samples (5 time points, 3 samples each time point)?
2. SAM: Do we need to do SAM before clustering? Seems to me that clustering is good enough to show tendency of gene expression changes.
3. Hierachical clustering: Probably two categories (low and high) is enough instead of 3 categories (zero, low and high). We are interesting in genes which increase over time, or decrease over time.
4. K-means clustering: I do not know what additional information we may get from K-means.

What we would like to have in the end are:
1. A figure to show that our array data are of high quality (hierachical clustering)
2. Clustering figures to show gene expression change over time, and corresponding spreadsheets
3. Genes which define each time point

New Milestone — an email from Dr. Chen, Xiaoxin proposing divide and conquer straregy

Jianying,

Hao and I have discussed your PPT and 16 patterns. We can make some sense out of Pattern 1 to 6. We are not fully satisfied because some known critical genes (Cdx1, Cdx2, pax9) did not show up. I guess after we merged two datasets we lost a lot of data. So, let’s go for Plan B in our paper as follows:

1. We first show Agilent data (E11.5-E15.5-P0-P7, 38,290 rows).

a. patterns
b. pathways (GSA, GSEA, etc)

2. We then show Illumina data (E8.25-E11.5; 24,189 rows)

a. SAM analysis
b. pathways (GSA, GSEA, etc)

3. Finally we show E8.25-E11.5-E15.5-P0-P7 (5,652 genes)

a. merging two datasets
b. Patterns 1 to 6

For Part 1, you worked a little bit on Hao’s SAM data. It may not be good enough. You may start with original array data. For Part 2, the other group did a little bit data analysis. But their major focus was not esophageal development. Their data analyis was not optimal in finding genes/pathways which are critical for esophageal specification. For Part 3, you have finished all. Any questions please feel free to discuss.

Xiaoxin

Met with Xiaoxin, Hao, and ?

1. Adjust delta value according to q value, done but need to automate the process.
2. Get new gene list (pending on Xiaoxin’s response)
3. Fisher’s exact test (pending on Xiaoxin’s response)
4. Gene list sent out put from SAM analysis, email xiaoxin priorto down stream analysis

Pending information from hao

1. four more arrays,
2. swap sample info

Action items:
1. Talk to JP for gene spring license issue
2. GSA/GSEA analysis
3. Profiling plots…

As of May 18th,

1. Was able to have three DBs work for our GSA (I may need to explore the OVR on the Agilent data (four stages) as I am not quite confident on the alleged “mulitclass” advantage (by Tibshirani -:) in GSA (vs. GSEA)
2. Got the code worked for clustering
3. BUT, I ran into memory issues with R (on my computer). That said, just for the “viewing” and/or “detecting mislabeling”, can I may just use partial genes? ( I used to try this out whenever I ran into memory issue). But, will figure out an inclusive solution for this in the future.

Things are pending:

1. GeneSpring (have not got a chance to work on it yet)
2. Will test EPIG and make sure it works on this small laptop prior to having a more power windows server set up.
(As we all agreed, this is not critical and should not hinder us from advancing with our paper writing)

Recommendation:

Maybe we should start to layout what we have and start to draft the manuscript. I bet that we may need more (hopefully not significant more) when we start to write.

Opps! Possible mislabeling!

Thank you so much for your prompt response. Xiaoxin and I went through the treeview file and feel that we should separate the samples of 35 and 36. Would you plz cluster 35 samples and 36 samples individually? And exchange 36A1 and 35D3 before you do clustering. We will get 2 cluster fliles: one for 35 and another one for 36. We hope the results would be more organized.

Best,

Hao

Clean up disk

It seems an issue here at CHGV on the disk space:

First of all, I need to clean up the disk on /nfs/svaprojects/ and transfer my stuffs to /nfs/testRun/

Next is /nfs/chgv/seqanalysis3/

Hi Jianying,

There’s no way you could have known this but Latasha has top priority for using seqanalysis3 and she regularly uses all of that space.
In the future, only use seqanalysis3 after consulting with Latasha.

Jessica

OK, here is the deal:

On the dscr cluster, seqanalysis3 has the highest priority for running the pipeline, so stay away from it.
Therefore, try to use /nfs/chgv/seqanalysis2/ instead

On the lsrc cluster:

I have /nfs/testrun/ directory, which has 2T space. Should be enough for now.

Review on samtools :-)

It is time to review samtools, since there have been many new releases since v0.1.7, samtools.

Jessica Maia has been comparing the features and analysis results run through samtools verion 1.7 vs. v.1.12, although v1.13 (changed the order of genotypes displayed in the variant files to conform to the Variant Call Format 4.1 spec) and 1.14 (gender aware variant calls on the X chromosome)have already come out.

mpileup has been a good tool to call SNP and indel report in vcf format, need to make sure it works in my hand.

A copy of samtools v.12a is located: /nfs/seqsata01/ALIGNMENT/Software/samtools-0.1.12a

To get bam file added RG is giving me much headache!

It is quite confusing, among samtools, awk, qsub…

This command seems to work from command line: samtools view -h /nfs/testrun/GATK-recal/samples/dukscz0106/precal/Run1/s.5.bam | cat /nfs/testrun/GATK-recal/samples/dukscz0106/precal/rg_temp.txt – | awk ‘{ if (substr($1,1,1)==”@”) print; else printf “%stRG:Z:gan”,$0; }’ | samtools view -uS – | samtools rmdup – – | samtools rmdup -s – /nfs/testrun/GATK-recal/samples/dukscz0106/precal/aln.bam

BUT, not seems to work it it gets wrapped into the qsub. Problem solved maybe some weird characters.

OK, let’s try the following steps:

1. Sort the individual .bam file: echo “$sam_dir/samtools sort $run_dir/s.$lane.bam $run_dir/sorted.Run1.$lane”

2. Merge and add @RG information: $samtoolsCMD merge -rh /nfs/testrun/GATK-recal/samples/dukscz0106/precal/rg.txt $mergedBAM $dataDir/s.1.bam $dataDir/s.2.bam $dataDir/s.3.bam $dataDir/s.4.bam $dataDir/s.5.bam $dataDir/s.6.ba
m $dataDir/s.7.bam $dataDir/s.8.bam

3. Remove PCR duplication:
#Picard
java -jar -Xmx14g $picard_dir/MarkDuplicates.jar TMP_DIR=$combined_dir VALIDATION_STRINGENCY=SILENT INPUT=$combined_bam OUTPUT=$combined_rmdup_bam METRICS_FILE=$duplicate_metrics REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=5000000 VERBOSIT
=WARNING ASSUME_SORTED=true

Re-calibration with GATK

Please refer to my site for details on cracking the GATK ;

Please refer to my site for details on Learning with the GATK (April 8th, 2011) ;

Preparation:

Java stuffs:

Three copies of GATK executable are located on /nfs/*/SOFTWARE
Java home and jre path were passed in in the scripts.

Input file preparation

Working script: /home/chavi/jl407/GATK-testing/workingDir/gatk_titv_sam12_CountCov.sh (derived from Jessica’s titv script)
Reference file:/nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX/human_ref_36_50.fa
Input file: /nfs/chgv/seqanalysis2/ALIGNMENT/samples/mchd002A2/combined/combined_rmdup.bam
Output file: /home/chavi/jl407/GATK-testing/dataOutput/mchd002A2_gatk_recal.csv

I tried to run a test GATK process with the following command (on dscr),:

java -Xmx4g -jar /nfs/chgv/seqanalysis/SOFTWARE/GenomeAnalysisTK-1.0.5083/GenomeAnalysisTK.jar

-T CountCovariates
–assume_single_sample_reads mchd002A2
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output

Also, I noticed that once I run the CountVariates analysis, the software generates a *.rod.fai file

Now, I am transferring the process to our sva.lsrc cluster, with copy_data.pl drafted by Abanish Singh, it allows coping data from a drive mounted on one unix server to a drive mounted on a remote host using SSH connection.

April 13th, 2010
Now, I was able to test the multi-thread (12 processors on dscr) CountVariates, it seems that it does shorten the computing time, but NOT by 12 times as it claimed from Broad. We’ll see how long it takes.

OK, testing the same process with human dbsnp v4.0, dbsnp-ncbi


Preparing BAM files by adding read group information:

Since our merged BAM files from human genome sequence projects do NOT have the read group information, I have to add those information using the samtools .

A couple of available strategies available to us

1. Going forward, we should add this information by creating an extra step in the existing pipeline. With “-r” option in sampe or samse step.

2. For now, trying to add @RG information into the individual bam fils. Detail instruction has been posted at seqanswers forum site. Broad GATK site also listed the instruction to properly prepare the bam file for base calling score recalibration.

As a trial, using the available individual .bam file from dukscz0106, testing the procedure.

Raw file location: /nfs/testrun/GATK-recal/samples/dukscz0106/precal/Run*
Working directory: /home/jl407/project_2011/GATK_test/workingDir
Script: combine_bam.sh

Well, the trick/mistake here is to “sort” first before “merge”

To verify that the @RG has been incorporated into the combined .bam file, simply use samtools view -h .bam | more


Running recalibration step by step


1. Step one: CountCovariates:

java -jar -Xmx6g GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output
Totally: 8.9 hours

2. Step two: TableCalibration:

java -jar -Xmx6g GenomeAnalysisTK.jar
-T TableRecalibration
-R $reference
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-recalFile $tabRecalFile
–out $recalBAM
*Note,
-outputBam is deprecated, use –out instead
Currently, not supporting parallel, only single thread
Estimated: 22 hours


2.1. Step 2.1: Processing recalibrated BAM file:

Takes approx. 40 minutes for a genome bam file ~ 100GB. If the bam file has been properly sorted (which is the case in our alignment pipeline), a simple command will do the trick:

dir/samtools index sample_recal.bam

Instead of four steps go get a properly indexed bam file

But, need to be sent to the cluster.

3. Step three: CountCovariates (on recalibrated bam file):

java -jar -Xmx6g GenomeAnalysisTK.jar
-T CountCovariates
-nt 12
-R $reference
-B:eval,VCF $vcf_with_header
-I $inputBAM
–default_read_group mchd002A2
–default_platform illumina
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-cov CycleCovariate
-recalFile $output
Totally: 8.9 hours


4. Step four: AnalyzeCovariates (on raw bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile


5. Step five: AnalyzeCovariates (on recalibrated bam file):

java -jar -Xmx6g $gatk_dir/AnalyzeCovariates.jar
-outputDir $outputDir
-resources $resourcesDir
-Rscript $RDir (/opt/R271/bin/R) , we have to provide this Rscript parameter.
-recalFile $tabRecalFile

!! Need to fix the R code for assessment plots

Assessing the gain from base calling score recalibration

Criteria:

Broad QC metrics:

1. QQ plot
2. RMSE report
3. Q30 distribution
4.

These steps have been tested and they are working properly. Now, we need to incorporate these into new pipeline with recalibration.

Jessica’s comparison on samtools version:

1. Number of raw/filtered SNP
2. Overlap of filtered autosomal SNVs with dbSNP
3. Ration of homo to hetero SNVs
4. Proportion of hetero on duckscz0106
5. Concordance of SNVs with genotyping chip computed with SVA
6. Ti/Tv ratio (check)
7. Number of indels
8. Indel genome sensitivity
9. Indel Positive Predictive Value (ppv)
10. SNV genotype sensitivity
11. SNV genotype ppv

Calling variant with samtools v0.1.12a

Using mchd002A2 as an example:

1. script: /home/chavi/jl407/GATK-testing/workingDir/call_var_from_bam.sh
2. input bam: /nfs/chgv/seqanalysis3/GATK/samples/mchd002A2/postCal/mchd002A2_combined_recal.bam
3. output directory:/nfs/chgv/seqanalysis3/GATK/samples/mchd002A2/postCal
4. logfile: /home/chavi/jl407/GATK-testing/log/calling_var_samtools12.out/err

A little warning from Jessica: it can take 12 – 24 hours! In fact, it takes up to 4 days!!

Starting from here, a series of assessment will be carried out.

Running the pipeline

Here is the instruction from Jessica for running the pipeline:

Hi Jianying,

There is about 1.6 T of space left in this drive for you:
/nfs/chgv/seqanalysis2

When I aligned this sample, it took up 1.4 T of space.

Here’s the script I used to create the alignment scripts:
/nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/devel/bwa_sam12.pl

Usage:
perl bwa_sam12.pl dukscz0106 genome seqanalysis2 seqsata09 seqsata09
all high jl407 not 1 sva.lsrc.duke.edu

It will create the following directory:
/nfs/chgv/seqanalysis2/ALIGNMENT/samples/dukscz0106

At this point, you can look at the manual to see which scripts should
be run first, and how to submit your jobs.

Here’s the information that you will need to run concordance:

SAMPLE_ID, GENOTYPING_ID. TopStrand_filename
dukscz0106, Duk-Schiz0106DEF,
/nfs/sva01/SequenceAnalyses/TopStrandFiles/TopStrandFiles/Duke_Scz_16_samples__DEC_09_09_FinalReport_TOP

Let me know if you have any questions.

Jessica

Met with Latashia and Jessica and will run one pipeline process with Latashia when the next batch of data is released.

Met with Latashia today and she walked me through the major steps running the pipeline. There is one big difference between whole genome sequencing and exom sequencing.

Can we modify the BWA by adding ReadGroup? Currently, our command looks like the following:
$bwa_dir/bwa sampe -P -a $a $scratch_dir/$index_name/$bwa_database_name $run_dir/s.$lane.1.sai $run_dir/s.$lane.2.sai $run_dir/s.$lane.1.maq.fastq $run_dir/s.$lane.2.maq.fastq > $run_dir/s.$lane.sam

Can we add “-r” option ?

Need to make sure that option “-r” does work.

For now, we can sue samtools to add read group information

Alert

Running whole genome alignment on dscr against human reference genome build 37.1 as did by 1000 genome. But, found out an incompatibility with exiting script bwa_sam12.pl. Double checked with Jessica, who said that exiting script that Latasha is running was modified to accomodate new sample_info.txt format. Details see help from Jessica post.

Here was the rationale testing the new human reference database as part of GATK recalibration effort.

Two whole genome sequence studies involved
NA12878 (1000 genome released fastq)
Als9c2 (HiSeq @CHGV)
Aligner: bwa-0.5.9
BWA alignment parameter: default + “-a” for insert size
Reference DB: /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
Samtools: 0.1.12a
GATK: 1.0.5083
SNP in vcf format: 00-All.vcf (human_9606/VCF/v4.0/)
Covariates: readgroup*, Qraw, dinucleotide, machine cycle
Memory per node: 16 GB
Processor involved: 12#
Note*: not available for NA12878
# : whenever parallel capacity is available

Running one genome alignment on new DSCR OS

A file Y:BioinformaticsAlignmentsALIGNMENTS.xls keeps track of what sample have been run/released and/or which one is in backlog.

DSCR has updated OS recently, and details at bioinformatics wiki

Also, the drives have been updated as well!

Step 1: perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/sample_info.pl ~/temp/genomeSample.txt /nfs/chgv/seqpipe01/WGseq/Run_Summary.txt
Step 2: perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/bwa.pl dodvc414268 genome seqpipe02 seqsata09 seqsata09 all high jl407 not 1 compute1.lsrc.duke.edu
Step 3: perl /nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts/qsub.pl all

There was an error with aspera, therefore, we had to kill the job and re-generate the script with “rsync” copy. Here is the email:

Hi All,

Hee Shin is working with Joe to get Aspera to do multiple transfers at once. Until things are sorted out, we’re going to go back to using Rsync. Here are the steps you need to take:

[1] Kill your copying jobs if they are still running on the DSCR [2] Re-generate your copy.sh scripts in batch mode following section
1.12.1 “Generating some Alignment Scripts – Batch mode” in the documentation. This step will generate new copy.sh scripts which will use Rsync instead of Aspera.
[3] Re-submit your new copy.sh scripts following Section 2.2.2 in the manual “Submitting one task – Batch mode”.

One more thing: the host name you use should be sva6.lsrc.duke.edu.

Jessica

Here are my scripts:

perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

Correction step by step

Step 1, kill the stalling job:

[jl407@login-05 combined]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
1893785 0.27644 g1cpdodvc4 jl407 r 06/21/2011 10:42:28 highprio.q@chavi-n13 1
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n13 1 1
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n03 1 2
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n01 1 3
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 4
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 1
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 2
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n11 1 3
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n05 1 4
1951582 0.26889 g1aln3.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n12 1 1
1951583 0.26889 g1aln3.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 1
1953275 0.25373 callVar jl407 r 06/23/2011 13:58:15 lowprio.q@igspnih-n06 1
1951581 0.00000 g1spe2.dod jl407 hqw 06/23/2011 10:59:25 1 1-4:1
1951584 0.00000 g1spe3.dod jl407 hqw 06/23/2011 10:59:25 1 1
1951585 0.00000 g1mrgF_dod jl407 hqw 06/23/2011 10:59:25 1
[jl407@login-05 combined]$ qdel 1893785
jl407 has registered the job 1893785 for deletion
[jl407@login-05 combined]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n13 1 1
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n03 1 2
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n01 1 3
1951579 0.26889 g1aln2.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 4
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 1
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 2
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n11 1 3
1951580 0.26889 g1aln2.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n05 1 4
1951582 0.26889 g1aln3.1.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n12 1 1
1951583 0.26889 g1aln3.2.d jl407 r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 1
1953275 0.25373 callVar jl407 r 06/23/2011 13:58:15 lowprio.q@igspnih-n06 1
1951581 0.00000 g1spe2.dod jl407 hqw 06/23/2011 10:59:25 1 1-4:1
1951584 0.00000 g1spe3.dod jl407 hqw 06/23/2011 10:59:25 1 1
1951585 0.00000 g1mrgF_dod jl407 hqw 06/23/2011 10:59:25 1

Step 2, remove existing copy script:

[jl407@login-05 Scripts]$ pwd
/nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts
[jl407@login-05 Scripts]$ ls -al
total 52
drwxr-xr-x 4 jl407 chavi 4096 Jun 17 16:37 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 1089 Jun 17 16:37 copy.sh
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
[jl407@login-05 Scripts]$ mkdir temp
[jl407@login-05 Scripts]$ mv copy.sh temp/
[jl407@login-05 Scripts]$ ls -al
total 52
drwxr-xr-x 5 jl407 chavi 4096 Jun 24 15:21 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
drwxr-xr-x 2 jl407 chavi 4096 Jun 24 15:21 temp

Step 3. Re-generate copy script:

perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

[jl407@login-05 Scripts]$ perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

No directories were created
perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/bwa.pl dodvc414268 genome seqpipe02 seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu
[jl407@login-05 Scripts]$ ls -al
total 56
drwxr-xr-x 5 jl407 chavi 4096 Jun 24 15:22 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 1061 Jun 24 15:22 copy.sh
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
drwxr-xr-x 2 jl407 chavi 4096 Jun 24 15:21 temp

Step 4. Check new script and resubmit the script.

perl /nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts/qsub.pl copy

April 11th, 2011

Today, I spent five hours at the Duke Medical employee orientation, which turned out to be un-relevant to me, who is not a health care providers 🙂 But, it was still very useful and helpful to get information about benefit, general safety, and medantory trainings.

Alarm set for signing up medical and other benefit within the first 60 days of employment.

Ok, talked to Joe, who is always helpful!

1. Get mapped linux drives from Windows (via Samba drive) from Qian, and Joe helped with the set up.
2. Met with Dongliang for advice. Accoring to Dongliang, David needs a solid and confident assessment on implementing the GATK into our existing pipeline with three primary questions (earlier page).
3. Will meet with Jessica for help again 🙂
4. Wanted to test whether Jessica’s script runs in my hands.
5. Talked to Jessica and got a lot of information!
6. Jessica will help: (1) get the allowable space for me to rerun (2) get BWA 0.4.9 pipeline scripts (3) will focus on dukscz0106 sample for the comparison purpose (4) first get ti/tv ratio tested!!!
7. Need to get read group informaiton from Ryan (he agrees to help out)

First pipeline run, yahoo! To evaluate ti/tv ratio for dukscz0106 sample
1. Script (gatk_titv_sam12_jl.sh) is located: /home/chavi/jl407/GATK-testing/workingDir/
2. Input bam file is: /nfs/chgv/seqanalysis2/ALIGNMENT/samples/dukscz0106/combined/combined_rmdup.bam
3. vcf file is: /home/chavi/jl407/GATK-testing/testRunJM/dukscz0106/vcf_snp_with_header_for_gatk
4. qsub gatk_titv_sam12_jl.sh