Running the pipeline

Here is the instruction from Jessica for running the pipeline:

Hi Jianying,

There is about 1.6 T of space left in this drive for you:
/nfs/chgv/seqanalysis2

When I aligned this sample, it took up 1.4 T of space.

Here’s the script I used to create the alignment scripts:
/nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/devel/bwa_sam12.pl

Usage:
perl bwa_sam12.pl dukscz0106 genome seqanalysis2 seqsata09 seqsata09
all high jl407 not 1 sva.lsrc.duke.edu

It will create the following directory:
/nfs/chgv/seqanalysis2/ALIGNMENT/samples/dukscz0106

At this point, you can look at the manual to see which scripts should
be run first, and how to submit your jobs.

Here’s the information that you will need to run concordance:

SAMPLE_ID, GENOTYPING_ID. TopStrand_filename
dukscz0106, Duk-Schiz0106DEF,
/nfs/sva01/SequenceAnalyses/TopStrandFiles/TopStrandFiles/Duke_Scz_16_samples__DEC_09_09_FinalReport_TOP

Let me know if you have any questions.

Jessica

Met with Latashia and Jessica and will run one pipeline process with Latashia when the next batch of data is released.

Met with Latashia today and she walked me through the major steps running the pipeline. There is one big difference between whole genome sequencing and exom sequencing.

Can we modify the BWA by adding ReadGroup? Currently, our command looks like the following:
$bwa_dir/bwa sampe -P -a $a $scratch_dir/$index_name/$bwa_database_name $run_dir/s.$lane.1.sai $run_dir/s.$lane.2.sai $run_dir/s.$lane.1.maq.fastq $run_dir/s.$lane.2.maq.fastq > $run_dir/s.$lane.sam

Can we add “-r” option ?

Need to make sure that option “-r” does work.

For now, we can sue samtools to add read group information

Alert

Running whole genome alignment on dscr against human reference genome build 37.1 as did by 1000 genome. But, found out an incompatibility with exiting script bwa_sam12.pl. Double checked with Jessica, who said that exiting script that Latasha is running was modified to accomodate new sample_info.txt format. Details see help from Jessica post.

Here was the rationale testing the new human reference database as part of GATK recalibration effort.

Two whole genome sequence studies involved
NA12878 (1000 genome released fastq)
Als9c2 (HiSeq @CHGV)
Aligner: bwa-0.5.9
BWA alignment parameter: default + “-a” for insert size
Reference DB: /nfs/chgv/seqanalysis/ALIGNMENT/BWA_INDEX_JYL/human_g1k_v37.fasta
Samtools: 0.1.12a
GATK: 1.0.5083
SNP in vcf format: 00-All.vcf (human_9606/VCF/v4.0/)
Covariates: readgroup*, Qraw, dinucleotide, machine cycle
Memory per node: 16 GB
Processor involved: 12#
Note*: not available for NA12878
# : whenever parallel capacity is available

Running one genome alignment on new DSCR OS

A file Y:BioinformaticsAlignmentsALIGNMENTS.xls keeps track of what sample have been run/released and/or which one is in backlog.

DSCR has updated OS recently, and details at bioinformatics wiki

Also, the drives have been updated as well!

Step 1: perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/sample_info.pl ~/temp/genomeSample.txt /nfs/chgv/seqpipe01/WGseq/Run_Summary.txt
Step 2: perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/bwa.pl dodvc414268 genome seqpipe02 seqsata09 seqsata09 all high jl407 not 1 compute1.lsrc.duke.edu
Step 3: perl /nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts/qsub.pl all

There was an error with aspera, therefore, we had to kill the job and re-generate the script with “rsync” copy. Here is the email:

Hi All,

Hee Shin is working with Joe to get Aspera to do multiple transfers at once. Until things are sorted out, we’re going to go back to using Rsync. Here are the steps you need to take:

[1] Kill your copying jobs if they are still running on the DSCR [2] Re-generate your copy.sh scripts in batch mode following section
1.12.1 “Generating some Alignment Scripts – Batch mode” in the documentation. This step will generate new copy.sh scripts which will use Rsync instead of Aspera.
[3] Re-submit your new copy.sh scripts following Section 2.2.2 in the manual “Submitting one task – Batch mode”.

One more thing: the host name you use should be sva6.lsrc.duke.edu.

Jessica

Here are my scripts:

perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

Correction step by step

Step 1, kill the stalling job:

[jl407@login-05 combined]$ qstat
job-ID prior name user ————R 1893785 0.27644 g1cpdodvc4 jl407 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951582 0.26889 g1aln3.1.d jl407 1951583 0.26889 g1aln3.2.d jl407 1953275 0.25373 callVar jl407 1951581 0.00000 g1spe2.dod jl407 1951584 0.00000 g1spe3.dod jl407 1951585 0.00000 g1mrgF_dod jl407 [jl407@login-05 combined]$ jl407 has registered the [jl407@login-05 combined]$ qstat
job-ID prior name user ————R 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951579 0.26889 g1aln2.1.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951580 0.26889 g1aln2.2.d jl407 1951582 0.26889 g1aln3.1.d jl407 1951583 0.26889 g1aln3.2.d jl407 1953275 0.25373 callVar jl407 1951581 0.00000 g1spe2.dod jl407 1951584 0.00000 g1spe3.dod jl407 1951585 0.00000 g1mrgF_dod jl407 state submit/start at queue slots ja-task-ID
12;————————————————————————————————–
r 06/21/2011 10:42:28 highprio.q@chavi-n13 1
r 06/23/2011 10:59:28 highprio.q@chavi-n13 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n03 1 2
r 06/23/2011 10:59:28 highprio.q@chavi-n01 1 3
r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 4
r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 2
r 06/23/2011 10:59:28 highprio.q@chavi-n11 1 3
r 06/23/2011 10:59:28 highprio.q@chavi-n05 1 4
r 06/23/2011 10:59:28 highprio.q@chavi-n12 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 1
r 06/23/2011 13:58:15 lowprio.q@igspnih-n06 1
hqw 06/23/2011 10:59:25 1 1-4:1
hqw 06/23/2011 10:59:25 1 1
hqw 06/23/2011 10:59:25 1
qdel 1893785
job 1893785 for deletion
state submit/start at queue slots ja-task-ID
12;————————————————————————————————–
r 06/23/2011 10:59:28 highprio.q@chavi-n13 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n03 1 2
r 06/23/2011 10:59:28 highprio.q@chavi-n01 1 3
r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 4
r 06/23/2011 10:59:28 highprio.q@chavi-n02 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 2
r 06/23/2011 10:59:28 highprio.q@chavi-n11 1 3
r 06/23/2011 10:59:28 highprio.q@chavi-n05 1 4
r 06/23/2011 10:59:28 highprio.q@chavi-n12 1 1
r 06/23/2011 10:59:28 highprio.q@chavi-n10 1 1
r 06/23/2011 13:58:15 lowprio.q@igspnih-n06 1
hqw 06/23/2011 10:59:25 1 1-4:1
hqw 06/23/2011 10:59:25 1 1
hqw 06/23/2011 10:59:25 1

Step 2, remove existing copy script:

[jl407@login-05 Scripts]$ pwd
/nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts
[jl407@login-05 Scripts]$ ls -al
total 52
drwxr-xr-x 4 jl407 chavi 4096 Jun 17 16:37 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 1089 Jun 17 16:37 copy.sh
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
[jl407@login-05 Scripts]$ mkdir temp
[jl407@login-05 Scripts]$ mv copy.sh temp/
[jl407@login-05 Scripts]$ ls -al
total 52
drwxr-xr-x 5 jl407 chavi 4096 Jun 24 15:21 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
drwxr-xr-x 2 jl407 chavi 4096 Jun 24 15:21 temp

Step 3. Re-generate copy script:

perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

[jl407@login-05 Scripts]$ perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/submit_sample_jobs.pl ~/temp/genomeSample.txt genome seqpipe02 bwa seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu

No directories were created
perl /nfs/chgv/seqanalysis/SOFTWARE/alignment_scripts/bwa.pl dodvc414268 genome seqpipe02 seqsata09 seqsata09 copy high jl407 batch 1 sva6.lsrc.duke.edu
[jl407@login-05 Scripts]$ ls -al
total 56
drwxr-xr-x 5 jl407 chavi 4096 Jun 24 15:22 .
drwxr-xr-x 6 jl407 chavi 4096 Jun 17 16:37 ..
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 ALN
-rw-r–r– 1 jl407 chavi 1112 Jun 17 16:37 check_copy.sh
-rw-r–r– 1 jl407 chavi 547 Jun 17 16:37 concordance.sh
-rw-r–r– 1 jl407 chavi 243 Jun 17 16:37 convert.py
-rw-r–r– 1 jl407 chavi 1061 Jun 24 15:22 copy.sh
-rw-r–r– 1 jl407 chavi 600 Jun 17 16:37 coverage.sh
-rw-r–r– 1 jl407 chavi 802 Jun 17 16:37 erds.sh
-rw-r–r– 1 jl407 chavi 3438 Jun 17 16:37 merge_final.sh
-rw-r–r– 1 jl407 chavi 586 Jun 17 16:37 pileup2bco.q
-rw-r–r– 1 jl407 chavi 1376 Jun 17 16:37 qsub.pl
drwxr-xr-x 2 jl407 chavi 4096 Jun 17 16:37 SAMPE
drwxr-xr-x 2 jl407 chavi 4096 Jun 24 15:21 temp

Step 4. Check new script and resubmit the script.

perl /nfs/chgv/seqpipe02/ALIGNMENT/samples/dodvc414268/Scripts/qsub.pl copy

onlinenote

Running the pipeline

Leave a Reply Cancel reply