R apply usage

Scenario 1

I have a datamatrix of RNAseq data, the ID column looks something like the following

-----------------
head(geneIDs)
-----------------
"Xkr4|497097|locus1of1|1"  
"Sox17|20671|locus1of1|5"  
"Mrpl15|27395|locus1of1|3"
"Lypla1|18777|locus1of1|1" 
"Tcea1|21399|locus1of1|3"  
"Rgs20|58175|locus1of1|3" 

I want to parse them out and retain only the gene symbols, “sapply” here is very handy

as.vector(sapply (geneIDs, FUN = function (x) {strsplit(x, "\\|")[[1]][1]}))

Running the above line get me the following:

"Xkr4"  
"Sox17"  
"Mrpl15"
"Lypla1" 
"Tcea1"  
"Rgs20" 

WGS and WES variant calling — alignment score interpretation

Our collaborator, Dr. Laura Riva, indicates a criteria to filter out variant in her paper.

“Second, we removed base substitutions with a median alignment score of mutation-reporting reads <140 and complex indels."

This leads to my research of what it is.

Firstly, there is an explanation of alignment score by the Broad Institute

grep -v "@" seq_aln.sam | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'

Single Cell Analysis

Main steps:

1. Cellranger — get read
2. Seurat — analysis
3. Deconvolute — analysis

CIBERSORTx:  https://cibersortx.stanford.edu/
 
A benchmark for RNA-seq deconvolution analysis under dynamic testing environments
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02290-6
 
Benchmarking of cell type deconvolution pipelines for transcriptomics data
https://www.nature.com/articles/s41467-020-19015-1

STAR – RNAseq – GATK – Mutect2

Scenario:

For RNAseq data, if one uses STAR for the alignment, it could present difficulty for GATK toolkit for the following reasons (https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/)

STAR uses a similar scoring scheme to Tophat except that the value for uniquely mapped reads is 255 instead of 50.
The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/[number of loci the read maps to])) for multi-mapping reads. This scheme is same as the one used by Tophat…

In the BQSR steps, we will have this:

64162342 reads (100.00% of total) failing MappingQualityUnavailableFilter

Others (http://seqanswers.com/forums/showthread.php?t=21593) reported GTAK related challenge:

“255 actually in GATK’s CountCovariatesWalker is filtered out by MappingQualityUnavailableFilter in the top of walker:”

It will fail at the Mutect2 variant calling step.

From the SAM specification (http://samtools.github.io/hts-specs/SAMv1.pdf):

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

In the GATK legacy post (https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2013-06-06-2013-02-12/2187-Can-I-bypass-the-MappingQualityUnavailableFilter-in-Unified-Genotyper)

Bismark sets MAPQ to 255 for all alignments.

“ -rf ReassignMappingQuality -DMQ 60 ”

People have posted the same errors here at (https://github.com/bcbio/bcbio-nextgen/issues/2163) and propose the following work around

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Solution:

Work around with STAR

“So what is needed I think is just adding –outSAMmapqUnique 60 in STAR CLI.”

Work around with GTAK

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Alternatively, one can add the option “-T PrintReads -U ALLOW_N_CIGAR_READS” in the BQSR step to allow CIGAR string with “N”.

And, can turn off the filter with the option “–disable-read-filter MappingQualityAvailableReadFilter” in the mutect2 calling method

Top 75 bioinformatics blogs

Haha, there are ONLY up to 69 of them and it is not hard to make top 75, lol. I would like to see them all.

Over the time, I would like to collect them and rank them (by personal view only now) hopefully according to some “objective metric”.

  1. But, I like to rank Titus Brown’s site above Dave Tang’s. No offense to Dave, it it my personal opinion that I find Titus’ contains higher quality of work. Sorry Dave!. Titus’ home page has an interesting name
  2. Here is one from Dave Tang

 

Machine learning note

Here, an old topic but keeps coming to me — machine learning. Java? Python? or R?

Let’s focus on some readings.

A guy posts decision making algorithm with implementation in python, 
of course!

He does not reply why data link is broken, which can be easily found 
here at kaggle competition site.

Here is a very simple tutorial about decision tree algorithm.

Another example here should be enough

Found another quite basic explanation

More on decision tree — iterative preorder traversal

Here is the first example

Decision Tree implementation

A simple python example.

Another guy provides smile  (Statistical Machine Intelligence and 
Learning Engine) package implemented in Java

A guy posts decision making algorithm with implementation in python, 
of course!

Hierarchical Dirichlet Process in Cancer Study

From a recent paper on Nature Genetics by our collaborator Laura, from the supplemental data I learned that she used Hierarchical Dirichlet Process to extract the signature.

In fact, it is her colleague Nicola Roberts, who first posted the HDP R package in 2015. Nicola also has a vignette on his package using the TCGA data to publish his HDP package.

It is quite interesting but the process is a computational intensive process.

An Rshiny app for index calculation

Following several good help links

A complete one that works.
It links to a github project on this topic.
At this moment, I am stuck with scenario that the app works 
independently, but does NOT work once I put it as one of the tabs. 
And, ultimately, it is the case how to separate into separate files.
Here is a post on modulize the app
The links an example of theRShiny output a modoulized way.

In order to modify the current Signature Analysis app, I need to re-visit Rshiny basic

I am reading this tutorial to refresh my knowledge.