I am collecting RNAseq analysis protocols.
Simple weblinks
Griffth lab tutorial From Harvard University training
Details about the methods
More on the statistical modeling part
DESeq2 tutorial 1 DESeq2 tutorial 2
I am collecting RNAseq analysis protocols.
Simple weblinks
Griffth lab tutorial From Harvard University training
Details about the methods
More on the statistical modeling part
DESeq2 tutorial 1 DESeq2 tutorial 2
Learning sepsis for this competition.
Collecting the parameters for sepsis prediction
Risk factors
Sepsis and septic shock are more common if you:
Are very young or very old Have a compromised immune system Have diabetes or cirrhosis Are already very sick, often in a hospital intensive care unit Have wounds or injuries, such as burns Have invasive devices, such as intravenous catheters or breathing tubes Have previously received antibiotics or corticosteroids
Bolts and nuts
We need git for the version control. I wanted to give credit to Kaggle where I learned my first modeling from Trevor's Titanic Project Ask Guannan to install Centos7 as a dual boot for a Windows laptop
Getting ready for the competition
Unfortunately, since we try to attend this competition as freelance individuals, the organizer does not grant us the privilege. As a result, we have to give up and stop here. Well, that does not prevent us from pursuing the study for Data Science. As a matter of a fact, it triggers me to start a new site — Freelance Data Scientists.
Following the docker tutorial link to start installation and configuration of docker on linux centos7
Turn on the docker deamon
sudo systemctl start docker
A few interesting docker image that is helpful:
You can get a specific ubuntu version, here is the command
docker run -it ubuntu:16.04 /bin/bash
This link tells you what is docker image.
I am able to reproduce a few examples as a learning process
"bullet" a very simpleRshiny example, but it is useful! My friend Master Xu has kindly shared his docker development -- aluminiWSU, gives me a good tutorial
I tempt to make a docker image for Kevin Day’s SignatureAnalysis, hopefully as the first “real” deployment with docker image
SignatureAnalysis is an Rshiny application developed by a summer student Mr. Kevin Day For docker Rshiny application, you can NOT install "shiny"!! Here is the docker image for a "pull" request: here.
My Docker log in is
Docker examples
For whatever reason, this becomes my first trial example. Wanted to do one for this competition though.
It turns out that my disk space has grown into an issue and I need to clean them up. Here I am documenting the process for this.
Linux command to check disk space
df command - Shows the amount of disk space used and available on Linux file systems. du command - Display the amount of disk space used by the specified files and for each subdirectory. btrfs fi df /device/ - Show disk space usage information for a btrfs based mount point/file system.
Check how much have I used? The du command does not work accordingly, see the following:
du -h ~ : it lists everything !! du -sh ~ : it takes forever!!
It is okay to check a specific directory
du -sh project2018/NTP_exome_project/ 27T project2018/NTP_exome_project/
Here is a magic command
find /ddn/gs1/home/li11/project2017/exomeSeq/withDBsnp/ -user li11 -type f -printf "%s\n" | awk '{t+=$1} END {print t}'
To get protein-coding information:
Eventually, I will get a workflow for this.
I need the correct snpeff output. An uninformative post A guy tried to write an R module to convert Mutect1 call to VCF format.
The worst problem was the genome build. The so-called black6 genome for NTP mouse strain is NOT appropriate for the annotation, one has to rely on mm10 instead.
From the GATK main page, one can create a user’s account to download the non-commercial use package.
MuTect VCF format posted here
MuTect2 documentation
Mutect2 has many good features and is designed to work for INDEL, but it does not produce signature context.
MuTect1 output format is NOT well defined
There was a good forum post on formatting question
Discerning the difference between two MuTect version output.
This is astonishingly goodpost
I am using Mutect and Mutect2 for WES project, and here are some useful document I collected overtime.
MuTect1 was part of the GATK component.
GATK protocol — revisit
It has been some time since I used GATK. Now, much improvement has been made and I would like to revisit this software for a WES project.
Here is one from SeqAnswer in 2012, but it is still very useful
samtools sort .bwa.bam .bwa.sort #Index samtools index .bwa.sort.bam #mark duplicate java -Xmx5g -jar MarkDuplicates.jar INPUT=.bwa.sort.bam OUTPUT=.bwa.sort.deduped.bam METRICS_FILE=.duplicates REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE #Realignment based on known insert sites (Using Java 1.7 from now on as required by GATK) java -Xmx5g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Reference.fa -I .bwa.sort.deduped.arg.bam -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o .realign.intervals -S LENIENT java -Xmx5g -jar GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fa -I .bwa.sort.deduped.arg.bam -targetIntervals .realign.intervals -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o .bwa.sort.deduped.arg.realigned.bam -S LENIENT java -Xmx5g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Reference.fa -l INFO -I .bwa.sort.deduped.arg.realigned.bam -knownSites 1000G_phase1.indels.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites dbsnp_137.hg19.vcf -o .recalibration_report.grp -S LENIENT java -Xmx5g -jar GenomeAnalysisTK.jar -T PrintReads -R Reference.fa -l INFO -I .bwa.sort.deduped.arg.realigned.bam -BQSR .recalibration_report.grp -o .bwa.sort.deduped.arg.realigned.recalibrated.bam -S LENIENT
Came up with a paper on BMC genoics comparing five different somatic snp callers.
GATK UnifiedGenotyper in NaiveSubtract MuTect1 SomaticSniper from the original paper Installation help document for Strelka and the original paper on Bioinformatics VarScan2 from the original paper Classical samtools method also works.
Calling variants with samtools/bcftools
Help from samtools protocol Help from samtools/bcftools protocol
Start from SomaticSignatures R package, I am documenting the effort with cancer mutation signature.
Someone else is also interested in this and publishes signR
Since I updated NMF, now this SomaticSignatures function breaks
Popular WES analysis protocols
GeneStack hits the top on the list, way to go! The most popular free tool goes to the bedtools with an example and tutorial example by QuinLab More users' examples are available at the main bedtools manual
Nicola Roberts has a Hierarchical Dirichlet Process to extract cancer mutation signature.
License-based bioinformatics software
Some important links
A note on the CIGAR strings The original SAM manual A note on bitwise flag interpretation A real time calculator