Of course the hosting company we use is: dreamhost
A doomain name can be purchase from namecheap
I learned quite a lot from the following three youtube videos
David give a step-by-step tutorial on building a website.
Of course the hosting company we use is: dreamhost
A doomain name can be purchase from namecheap
I learned quite a lot from the following three youtube videos
David give a step-by-step tutorial on building a website.
I am trying to run jupyter notebook from our network server and I am connecting it from my windows machine with PuTTY. It works and I want to document this for future reference.
This link provides a very good help!
Basic set up via PuTTY connection from windows
Tunneling set up
Cn040701 is the proper name for ib01, which is the one I usually log on to.
The to run jupyter:
jupyter notebook –no-browser –port=8889 –ip=0.0.0.0
And finally I went to http://localhost:8889 to open it in a browser.
Scenario 1
I have a datamatrix of RNAseq data, the ID column looks something like the following
----------------- head(geneIDs) ----------------- "Xkr4|497097|locus1of1|1" "Sox17|20671|locus1of1|5" "Mrpl15|27395|locus1of1|3" "Lypla1|18777|locus1of1|1" "Tcea1|21399|locus1of1|3" "Rgs20|58175|locus1of1|3"
I want to parse them out and retain only the gene symbols, “sapply” here is very handy
as.vector(sapply (geneIDs, FUN = function (x) {strsplit(x, "\\|")[[1]][1]}))
Running the above line get me the following:
"Xkr4" "Sox17" "Mrpl15" "Lypla1" "Tcea1" "Rgs20"
Our collaborator, Dr. Laura Riva, indicates a criteria to filter out variant in her paper.
“Second, we removed base substitutions with a median alignment score of mutation-reporting reads <140 and complex indels."
This leads to my research of what it is.
Firstly, there is an explanation of alignment score by the Broad Institute grep -v "@" seq_aln.sam | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'
Main steps:
1. Cellranger — get read
2. Seurat — analysis
3. Deconvolute — analysis
CIBERSORTx: https://cibersortx.stanford.edu/ A benchmark for RNA-seq deconvolution analysis under dynamic testing environments https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02290-6 Benchmarking of cell type deconvolution pipelines for transcriptomics data https://www.nature.com/articles/s41467-020-19015-1
Scenario:
For RNAseq data, if one uses STAR for the alignment, it could present difficulty for GATK toolkit for the following reasons (https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/)
STAR uses a similar scoring scheme to Tophat except that the value for uniquely mapped reads is 255 instead of 50.
The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/[number of loci the read maps to])) for multi-mapping reads. This scheme is same as the one used by Tophat…
In the BQSR steps, we will have this:
64162342 reads (100.00% of total) failing MappingQualityUnavailableFilter
Others (http://seqanswers.com/forums/showthread.php?t=21593) reported GTAK related challenge:
“255 actually in GATK’s CountCovariatesWalker is filtered out by MappingQualityUnavailableFilter in the top of walker:”
It will fail at the Mutect2 variant calling step.
From the SAM specification (http://samtools.github.io/hts-specs/SAMv1.pdf):
MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
In the GATK legacy post (https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2013-06-06-2013-02-12/2187-Can-I-bypass-the-MappingQualityUnavailableFilter-in-Unified-Genotyper)
Bismark sets MAPQ to 255 for all alignments.
“ -rf ReassignMappingQuality -DMQ 60 ”
People have posted the same errors here at (https://github.com/bcbio/bcbio-nextgen/issues/2163) and propose the following work around
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
Solution:
Work around with STAR
“So what is needed I think is just adding –outSAMmapqUnique 60 in STAR CLI.”
Work around with GTAK
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
Alternatively, one can add the option “-T PrintReads -U ALLOW_N_CIGAR_READS” in the BQSR step to allow CIGAR string with “N”.
And, can turn off the filter with the option “–disable-read-filter MappingQualityAvailableReadFilter” in the mutect2 calling method
Haha, there are ONLY up to 69 of them and it is not hard to make top 75, lol. I would like to see them all.
Over the time, I would like to collect them and rank them (by personal view only now) hopefully according to some “objective metric”.
Here, an old topic but keeps coming to me — machine learning. Java? Python? or R?
Let’s focus on some readings.
A guy posts decision making algorithm with implementation in python, of course! He does not reply why data link is broken, which can be easily found here at kaggle competition site. Here is a very simple tutorial about decision tree algorithm. Another example here should be enough Found another quite basic explanation
More on decision tree — iterative preorder traversal
Here is the first example
Decision Tree implementation
A simple python example. Another guy provides smile (Statistical Machine Intelligence and Learning Engine) package implemented in Java A guy posts decision making algorithm with implementation in python, of course!
From a recent paper on Nature Genetics by our collaborator Laura, from the supplemental data I learned that she used Hierarchical Dirichlet Process to extract the signature.
In fact, it is her colleague Nicola Roberts, who first posted the HDP R package in 2015. Nicola also has a vignette on his package using the TCGA data to publish his HDP package.
It is quite interesting but the process is a computational intensive process.
I am working on AccordGWAS pipeline and need SLURM system to submit job to the cluster. Well, that gives me a reason to touch this:
How come I can miss this ultimate source!
Here is my first link, and I find this very useful tips.
The second one is smange that seems to do a good job!
More on SLURM/sbatch examples