Work progress doc (Oct. 2010)

Starting from today, I am using this blog to keep track of my work, replacing word, notepad, etc. It goes by month and hopefully, it helps!

10252010

(1). Check my QC modules

[jyli@lbg-compute ~]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
147313 0.55500 TrimAdaptR jyli r 10/20/2010 14:02:12 tcga.q@lbg-comp7-pvt.bioinf.un 1
147941 0.55500 uniqReadRe jyli r 10/20/2010 20:45:20 tcga.q@lbg-comp11-pvt.bioinf.u 1
147948 0.55500 AlignStatR jyli r 10/20/2010 20:48:20 tcga.q@lbg-comp10-pvt.bioinf.u 1

Three QC modules are still running. Send out the forth one just to correct the adapter contamination output!

However, two more issues came up:

1. What about the threshold of qc genome?
2. IllQualStat?

Solution:
1. It might be easier to implement a system to monitor the flag, view it individually (for upcoming new one), and reset to “zero” if possible
3. Need write permission for BCtrend modules
4. Switch to seqware user is urgent!

(2). Check Ying Du’s data processing

Ying’s email:

Hi Jianying,

It seems that that seqware has already processed these two flow cells by last weekend.Thanks a lot. Here I just have some questions regarding some individual file content and hope you can help.

1. In the file, say, “101004_UNC5-RDR300700_00028_FC_62GLPAAXX.1.trimmed.annotated.gene.quantification.txt”:
a. what are the column headers? I figure that the first two are: gene and read.
b. How about the last two? Are they all RPKM?
c. If so, why there are two RPKM columns?
d. what does “?|100130436” mean? Is this a real gene or not?

The following is an example file:

?|100130426 3 0.558823529411765 0.316155262260932
?|100133144 21 1.32119205298013 0.747466414749356
?|100134869 18 0.858218318695107 0.485538319959073
?|10357 58 6.93081761006289 3.9211206111859
?|10431 574 38.8085867620751 22.0754422121086
— —
Solution: responded itemize

2. What’s the difference among the following files, particularly, c, and a & d?
a. “*.trimmed.annotated.exon.quantification.txt”
b. “*.trimmed.annotated.gene.quantification.txt”
c. “*.trimmed.annotated.spljxn.quantification.txt”
d. “*.trimmed.annotated.transcript.quantification.txt”

(3) Check html report on flags

(4) Start figuring out “generic” plot, which has been long postponed!!!

(5) Respond to Brian about work progress

In the past week:

1. Learned how to monitor seqware run, dissect problems and debug if necessary. Detail info Here

a. Consistently check LIMS looking for erroneous record, i.e. errors, stalled processing, flowcell without lane info. etc.
b. Check two cron jobs: runfolder2srf and seqware-bwa alignment
c. Check log files
d. Manually start some failed run
e. For some strange errors, need to look deeper in the log files, experiment information etc.
(1) Whether a touch file exists?
(2) Whether experimental information was correct, i.e. RNAseq was mis-placed as whole genome shot gun
(3) Email Matt, if it looks like an error links to PIPE process

2. QC module currating

a. Add BCtrend into workflow
b. Modified IllQualStat for proper parameterization
c. Modified TrimCountAdapater script and code, helped Matt with perl script and suggested strategy to further adapted it for small RNA project
d. Updates corresponding qc modules
e. Lower stringent of alignQual, uniqRead, and adapterCont, and qsubed retrospective fixing runs

3. Worked on Ying Du’s request

4. QC html report
a. Modified cronjob for nightly run on qc flag
b. Update all key_value in the module for proper MetaDB update and facilitate future query

For this week:
1. Check all QC modules to ensure proper fixing happened
2. Solve some mystery, i.e.

*********[jyli@lbg-compute ~]$ diff /datastore/nextgenout/seqware-analysis/illumina/100804_UNC6-RDR300211_00021_FC_629PLAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.1/seqware-0.7.0_BWA-0.7.0/100804_UNC6-RDR300211_00021_FC_629PLAAXX.6.adapterCont.txt  /datastore/nextgenout/seqware-analysis/illumina/100804_UNC6-RDR300211_00021_FC_629PLAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.4/seqware-0.7.0_BWA-0.7.0/100804_UNC6-RDR300211_00021_FC_629PLAAXX.6.adapterCont.txt

But, in qsub log, I don't see "100804_UNC6-RDR300211_00021_FC_629PLAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.4/seqware-0.7.0_BWA-0.7.0/100804_UNC6-RDR300211_00021_FC_629PLAAXX.6.adapterCont.txt ", why??

*********

2. Need to get switch off /seqware/ user for future LIMS/pipeline curation
3. Figure out a qc-flag updating system for “questionable” flag and flag reset (need Brian’s input)
4. Generic plot (QC flag, and all), and aggregate_plot still has some bugs ..
5. Need write permission on newly generated files (by Brian), will provide a list of file extension for Brian for periodically fixing
6. For those flowcells that got entered multiple times, need to figure out a systematically to fix that problem, i.e. [jyli@lbg-compute seqware-0.7.0_BWA-0.7.0]$ cd ../../../100603
100603-UNC3-RDR300156_0007/ 100603_UNC3-RDR300156_0007_61MM1AAXX/

1027201

1. Check uniqRead module fixing
a. From the qsub, it seems that it has finished
[jyli@swmaster ~]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
147313 0.55500 TrimAdaptR jyli r 10/20/2010 14:02:12 tcga.q@lbg-comp7-pvt.bioinf.un 1
147948 0.55500 AlignStatR jyli r 10/20/2010 20:48:20 tcga.q@lbg-comp10-pvt.bioinf.u 1
151838 0.55500 AdapterCon jyli r 10/25/2010 12:21:30 tcga.q@lbg-comp6-pvt.bioinf.un 1
b. But, there are still many missing flags in the html report, i.e. 101004_UNC5-RDR300700_00028_FC_62GLPAAXX.8
c. Go go the directory: /datastore/nextgenout/seqware-analysis/illumina/101004_UNC5-RDR300700_00028_FC_62GLPAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.4/seqware-0.7.0_BWA-0.7.0/
d. But, 101011_UNC8-RDR3001640_00037_FC_62M6WAAXX.1 has the results!! (done on 10-26-2010)
e. What is going on with “101004_UNC5”?

Solution:
a. Go to the directory and check log “/home/jyli/svnroot/TCGA-dev/scripts/user/jyli/QC_modules/uniqReadDir”
b. Could not find a record associated with 101004, less uniqReadRetro.o147941, but did find 101011
c. Check script and check DB
d. Use this command to get query result written into a file, stalled for some reason? Well, has to be swmaster!!!
**************************
psql -h swmaster -U seqwarero -W seqware_meta_db -c “select f.file_path from processing as p, processing_files as pf, file as f where f.file_path like ‘%uniqRead.txt’ and p.algorithm = ‘uniqRead’ and p.processing_id = pf.processing_id and pf.file_id = f.file_id; ” > uniqRead_files.txt
**************************
e. Guess, find out the answer: it exists in DB but no record on the file system
f. Need help from Brian for DB debugging!!!
g. It seems that it happened randomly, but cross the board, HELP!!
h. Another case: 100908_UNC8-RDR3001640_00020_FC_62ENHAAXX.7!! Lane 7 failed (so did lane 1), but two 100908_UNC8-RDR3001640_00020_FC_62ENHAAXX.4.uniqRead.txt were generated??
i. Regardless of what happened, try to run the script on the “erroneous” lane and found out that the reason it failed, because no .srf file existed and it failed at the converting step and further no .fastq file for such process

From group meeting:

Two urgent stuffs need a deadline:

1. Friday (10-29-10), for the qc threshold fixing
New strategy: instead of re-run module, just fixing the qc file, re-cal threshold and replacing the file.
2. Next week (first week in Nov.) for aggregate report

10282010

Code review from Brian
*************
Jianying, most have the error:

net.sourceforge.seqware.pipeline.modules.qc.IllQualStat.init
‘infile’ is not a recognized option
The method ‘init’ exited abnormally so the Runner will terminate here!
seqware_meta_db=> select algorithm, count(algorithm) from processing where status = ‘failed’ group by algorithm

For Jianying:

>>> perl/bin/calculate_adapter_contamination.pl <<$output”);
-#print OUT “query_filet$infileninput_seqt$primernadapter_seqt$rcoutn”;
+print OUT “query_filet$infileninput_seqt$primernadapter_seqt$rcoutn”;

$err = “”;
if (! -e “$infile”) { $err = “input file does not exist”; }
@@ -92,10 +92,9 @@
$pctA = sprintf(“%.2f”, 100*$ct2/$ct1);
$pctAD = sprintf(“%.2f”, 100*$ctsize0/$ct1);
$mean = $efflengthsum/$ct1;
– if ($pctA < 15){$flag =0;} #Lowever to low stringency, 10192010 JYL
– #print OUT "run_statustSUCCESSnadapter_ctt$ct2nadapter_percntt$pctAnmean_effective_read_lengtht$meannadapter_dimer%t$pctADn";
– print OUT "run_statustSUCCESSnadapter_ctt$ct2nadapter_percntt$pctAnadaptor_flagt$flagn";
– # print OUT "adaptor_flagt$flagn";
+ if ($pctA >> perl/bin/sw_module_TrimCountAdapter.pl <<<

15 needs to be a parameter

Ans: Yes, this should happen in 0.7.5

Index: perl/bin/sw_module_TrimCountAdapter.pl
===================================================================
— perl/bin/sw_module_TrimCountAdapter.pl (revision 1788)
+++ perl/bin/sw_module_TrimCountAdapter.pl (working copy)
@@ -109,13 +109,7 @@
}
close(OUTFQ);
$pct = sprintf("%.2f", 100*$adpreadct/$allreadct);

-$flag = 1;
-if ($pct >> perl/bin/sw_module_alignStat.pl <<<

30 and 90 need to be parameters not hardcoded.

Ans: Yes, this should happen in 0.7.5

Index: perl/bin/sw_module_alignStat.pl
===================================================================
— perl/bin/sw_module_alignStat.pl (revision 1788)
+++ perl/bin/sw_module_alignStat.pl (working copy)
@@ -57,8 +57,7 @@
print OUT "alignstat_total_readt$totalCountn";
print OUT "alignstat_mapped_readt$mappedCountn";
print OUT "alignstat_mapped_percntt$percentagen";
– #if ($percentage 85){
– if ($percentage 90){ #lowered the
stringency for reporting purpose, per group meeting discussion with
Brian and Matt on 10132010
+ if ($percentage 85){
$flag = 1;
}else{ $flag = 0; }
print OUT “alignstat_flagt$flagn”;

>>> perl/bin/sw_module_readDepth.pl << 50 && uniquePerc 30 && uniquePerc 1000000) { $totalFlag = 0;}
-#if ($totalSeq > 2000000) { $totalFlag = 0;} #lowered stringency 10202010 JYL
+if ($uniquePerc > 50 && uniquePerc 2000000) { $totalFlag = 0;}
print O “input_filet$infilen”;
print O “total_readst$totalSeqn”;
print O “unique_readst$uniqueSeqn”;

Java code:

>>> java/src/net/sourceforge/seqware/pipeline/modules/qc/geneCoverage.java <<<

Did you change the workflow XML?

Ans: No, I don't know how to change workflow XML.

Index: java/src/net/sourceforge/seqware/pipeline/modules/qc/geneCoverage.java
===================================================================
— java/src/net/sourceforge/seqware/pipeline/modules/qc/geneCoverage.java (revision
1788)
+++ java/src/net/sourceforge/seqware/pipeline/modules/qc/geneCoverage.java (working
copy)
@@ -224,9 +224,8 @@
ReturnValue ret = new ReturnValue();
ret.setExitStatus(ReturnValue.SUCCESS);
ret.setRunStartTstmp(new Date());
– String output = (String)options.valueOf("outfile");
– String outplot = (String)options.valueOf("geneCoveragePlot");

+ String output = (String)options.valueOf("outfile");
+ String outplot = (String)options.valueOf("covergePlotFile");
StringBuffer cmd = new StringBuffer();
cmd.append(options.valueOf("perl") + " " +
options.valueOf("script") + " " + options.valueOf("infile") + " ");
cmd.append(options.valueOf("outfile") + " " +
options.valueOf("geneCoveragePlot") + " ");
@@ -245,7 +244,6 @@

//Need to fix here for plot in R, jyl
– //FIXME Only register single file??
FileMetadata fm2 = new FileMetadata();
fm2.setMetaType("png/geneCoverage plot");
fm2.setFilePath(outplot);

**************

Fixing QC flags ad hoc

Flag stringency has been lowered and located at: /home/jyli/current_projects/aggregat_report/qc_flags/collected_flags.xlsx

Current flag setting in the working modules:

uniqRead:

if ($uniquePerc > 30 && $uniquePerc &lt 95 1000000) { $totalFlag = 0;}
if ($uniqFlag ==1 || $totalFlag==1){$flag = 1;}

adapterCont/TrimCountAdapt:

if ($pctA < 15){$flag =0;}
if ($pct < 15) {$flag =0;}

IllQualStat:

if ($lines[$i] < 70) {$C_flag = 1;}
if ($lines[$i] < 70) {$I_flag = 1;}
if (($C_flag == 1) && ($I_flag == 1))
{ print OUT "illqualstat_flagt1n";
}else {print OUT "illqualstat_flagt0n";}

alignStat:

if ($percentage 90){ #lowered the stringency for reporting purpose, per group meeting discussion with Brian and Matt on 10132010
$flag = 1;
}else{ $flag = 0; }

qcGenome:

if ($pctR > 10 || $pctV > 5){$flag = 1;}

BCtrend:

$FLAG = 2;
if ($drop > $FLAG)
{
print OUT “$lineCntt$dropn”;
$flag = 1;
}

1. Instead of running the modules again (which takes much time), I am re-writing scripts to re-calculate the threshold and update the flags.
2. Workingdir: /home/jyli/svnroot/TCGA-dev/scripts/user/jyli/QC_modules/qcFlagFixing
3. For each QC flag
a. uniqRead: uniqRead_reCal.pl, fixing_uniqRead_flags.job
b. alignStat: alignStat_reCal.pl, fixing_alignStat_flags.job
c. qcGenome: qcGenome_reCal.pl, fixing_qcGenome_flags.job
d. adaptercont: adapterCont_reCal.pl fixing_adapterCont_flags.job

10292010

1. Get to watch over the seqware pipeline to catch errors early on
2. Two major module errors: srf2fastq and illqualstat. Brian fixed IllQualStat, and Matt is working with Brian to fix the other
3. Matt sent document for proper handling of parameterization in Perl program

***begining of program:

use strict;
use Getopt::Long;

my ($primer, $infile, $outfastq, $outstats, $seed_length, $flag_bad_adpt_value, $flag_bad_adpt_symbol, $min_read_length);
my $argSize = scalar(@ARGV);
my $getOptResult = GetOptions(‘primer=s’ => $primer,
‘infile=s’ => $infile,
‘outfastq=s’ => $outfastq,
‘outstats=s’ => $outstats,
‘seed_length=s’=>$seed_length,
‘flag_bad_adpt_value=i’ => $flag_bad_adpt_value,
‘flag_bad_adpt_symbol=s’ => $flag_bad_adpt_symbol,
‘min_read_length=i’ => $min_read_length);

usage() if ( $argSize < 14 || !$getOptResult)

***end of program, or where ever your functions are:
"
# statement displayed to user if parameters not properly filled
sub usage {
print "Unknown option: @_n" if ( @_ );
print "usage: program –primer [PRIMER or reverse complement of adapter, ACTG are possible] –infile [FASTQ format of reads] –outfastq [FASTQformat of trimmed reads] –outstats [tab-delimited text file of trimming statistics] –seed_length [Minumum length of the first adapter nucleotides you want to match sequencing reads] –flag_bad_adpt_value [Possible values are integers between [0..100]. What limit do you want to assign to adapter percentage so that the fastq file is marked 'bad'] –flag_bad_adpt_symbol [Possible values are 'lt' or 'gt'; Do you want to assign a 'bad' flag if the amount of adapter in fastq file is 'less than'(lt) or 'greater than'(gt) the 'flag_bad_adpt_value' paramter] –min_read_length [minimum number of nucleotides in each read of the filtered fastq file, the rest are thrown away] [[–help|-?]n";
exit;
}
************************************

ERRORS!!
There is problem with adapterCont files, 23 of them were in the TrimCountAdapt format!!
Solution: return the module first to convert those files to the correct format!
Script used: /home/jyli/svnroot/TCGA-dev/scripts/user/jyli/QC_modules/qcFlagFixing/TrimCountAdapterSingleRun.pl

Once this is fixed, will run adapter recaluation!

From Friday group meeting

1. Need to watch the seqware pipeline and identify problems if any
2. Need to re-run IllQualStat since this had the most failure from the pipeline run
3. Need to start/complete the aggregate plot urgently!!!!!

Checking QC output performance per Brian’s request

################
Also, you were going to make sure that all your code is checked in and you’re good for the 0.7.4 tag, is all your perl/java code checked in and passing compilation?
################

Using “100901_UNC6-RDR300211_00028_FC_62ERHAAXX” (lane 7) as an example since it has 328 success with no errors:

a. adapterCont:
run_status SUCCESS
adapter_ct 119
adapter_percnt 1.09
adaptor_flag 0

b. alignStat:

alignstat_query_file /datastore/scratch/users/brianoc/pegasus/RNASeqAlignmentBWA/run0262/./temp33634210274080599152657268018340463/100901_UNC6-RDR300211_00028_FC_62ERHAAXX.7.trimmed.annotated.sam
alignstat_total_read 10667
alignstat_mapped_read 8844
alignstat_mapped_percnt 82.9099090653417
alignstat_flag 0

c. BCtrend
bctrend_flag 0

d. IllQualStat:
experiment 100901_UNC6-RDR300211_00028_FC_62ERHAAXX
machine UNC6-RDR300211
lane 7
lane_yield 2313
clusters_raw 313252
clusters_raw_sd 22114
clusters_pf 253570
clusters_pf_sd 11888
1st_cycle_int 370
1st_cycle_int_sd 15
percnt_intensity_after_20_cycles 78.88
percent_intensity_after_20_cycles_sd 1.61
percent_pf_dlusters 81.08
percent_pf_clusters_sd 2.06
percent_align 0
percnet_align_sd
alignment_score 0
alignment_score_pf
percent_error_rate 0
illqualstat_flag 0

e. QCGenome:
qcgenome_query_file datastore/nextgenout/seqware-analysis/illumina/100901_UNC6-RDR300211_00028_FC_62ERHAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.3/seqware-0.7.0_BWA-0.7.0/100901_UNC6-RDR300211_00028_FC_62ERHAAXX.7.trimmed.fastq
qcgenome_read_total 10667
reads_rRNA 49
percnt_rRNA 0.46
reads_viral 4
percnt_viral 0.04
qcgenome_flag 0

f. uniqRead:
input_file /home/jyli/svnroot/TCGA-dev/scripts/user/jyli/QC_modules/uniqReadDir/100901_UNC6-RDR300211_00028_FC_62ERHAAXX.7.fastq
total_reads 10872
unique_reads 10737
unique_percent 98.7582781456954
read_pf_flag 1

Another example: 100810_UNC2-RDR300275_00019_FC_629FEAAXX (lane 4)

adapterCont:
run_status SUCCESS
adapter_ct 940956
adapter_percnt 4.42
adaptor_flag 0

alignStat:
alignstat_query_file /datastore/scratch/users/brianoc/pegasus/RNASeqAlignmentBWA/run0163/./temp763227124195285300994823188765076/100810_UNC2-RDR300275_00019_FC_629FEAAXX.4.trimmed.annotated.sam
alignstat_total_read 21230415
alignstat_mapped_read 18109190
alignstat_mapped_percnt 85.2983326044262
alignstat_flag 0

BCtrend:
bctrend_flag 0

IllQualStat:
experiment 100810_UNC2-RDR300275_00019_FC_629FEAAXX
machine UNC2-RDR300275
lane 4
lane_yield 1619
clusters_raw 202556
clusters_raw_sd 16202
clusters_pf 177569
clusters_pf_sd 13108
1st_cycle_int 419
1st_cycle_int_sd 13
percnt_intensity_after_20_cycles 79.64
percent_intensity_after_20_cycles_sd 1.99
percent_pf_dlusters 87.71
percent_pf_clusters_sd 1.10
percent_align 0
percnet_align_sd
alignment_score 0
alignment_score_pf
percent_error_rate 0
illqualstat_flag 0

QCGenome:
qcgenome_query_file datastore/nextgenout/seqware-analysis/illumina/100810_UNC2-RDR300275_00019_FC_629FEAAXX/seqware-0.7.0_RNASeqAlignmentBWA-0.7.4/seqware-0.7.0_BWA-0.7.0/100810_UNC2-RDR300275_00019_FC_629FEAAXX.4.trimmed.fastq
qcgenome_read_total 21230415
reads_rRNA 223212
percnt_rRNA 1.05
reads_viral 135055
percnt_viral 0.64
qcgenome_flag 0

uniqRead:
input_file /home/jyli/svnroot/TCGA-dev/scripts/user/jyli/QC_modules/uniqReadDir/100810_UNC2-RDR300275_00019_FC_629FEAAXX.4.fastq
total_reads 21308392
unique_reads 11634508
unique_percent 54.6005911661471
read_pf_flag 0

Checking the modules