DII challenge

Learning sepsis for this competition.

Collecting the parameters for sepsis prediction

    Blood pressure: If sepsis progresses to septic shock, blood pressure drops dramatically. This may lead to death.
    High levels of lactic acid in your blood

 

Risk factors
Sepsis and septic shock are more common if you:

Are very young or very old
Have a compromised immune system
Have diabetes or cirrhosis
Are already very sick, often in a hospital intensive care unit
Have wounds or injuries, such as burns
Have invasive devices, such as intravenous catheters or breathing tubes
Have previously received antibiotics or corticosteroids

Bolts and nuts

We need git for the version control.
I wanted to give credit to Kaggle where I learned my first modeling from Trevor's Titanic Project
Ask Guannan to install Centos7 as a dual boot for a Windows laptop

Getting ready for the competition

	
  • Create a team (07/16/2019); team lead: Guannan; team name: shenzhou
  • Create a github team called DII_ShenZhou
  • Unfortunately, since we try to attend this competition as freelance individuals, the organizer does not grant us the privilege. As a result, we have to give up and stop here. Well, that does not prevent us from pursuing the study for Data Science. As a matter of a fact, it triggers me to start a new site — Freelance Data Scientists.

    My docker note

    Following the docker tutorial link to start installation and configuration of docker on linux centos7

    Turn on the docker deamon

    sudo systemctl start docker
    

    A few interesting docker image that is helpful:

    You can get a specific ubuntu version, here is the command

    docker run -it ubuntu:16.04 /bin/bash
    

    This link tells you what is docker image.

    I am able to reproduce a few examples as a learning process

    "bullet" a very simpleRshiny example, but it is useful!
    My friend Master Xu has kindly shared his docker development -- aluminiWSU, gives me a good tutorial
    

    I tempt to make a docker image for Kevin Day’s SignatureAnalysis, hopefully as the first “real” deployment with docker image

    SignatureAnalysis is an Rshiny application developed by a summer student Mr. Kevin Day
    For docker Rshiny application, you can NOT install "shiny"!!
    Here is the docker image for a "pull" request: here.
    

    My Docker log in is

     
    	
  • dockerli11
  • password -- my initial one
  • Docker examples

    For whatever reason, this becomes my first trial example.
    Wanted to do one for this competition though.
    

    Clean up disk space — linux

    It turns out that my disk space has grown into an issue and I need to clean them up. Here I am documenting the process for this.

    Linux command to check disk space

    df command - Shows the amount of disk space used and available on Linux file systems.
    du command - Display the amount of disk space used by the specified files and for each subdirectory.
    btrfs fi df /device/ - Show disk space usage information for a btrfs based mount point/file system.
    

    Check how much have I used? The du command does not work accordingly, see the following:

    du -h ~ : it lists everything !!
    du -sh ~ : it takes forever!!
    

    It is okay to check a specific directory

    du -sh  project2018/NTP_exome_project/
    27T     project2018/NTP_exome_project/
    

    Here is a magic command

    find /ddn/gs1/home/li11/project2017/exomeSeq/withDBsnp/ -user li11 -type f -printf "%s\n" | awk '{t+=$1} END {print t}' 
    

    Using GATK – MuTect/MuTect2

    From the GATK main page, one can create a user’s account to download the non-commercial use package.

    MuTect VCF format posted here
    MuTect2 documentation
    Mutect2 has many good features and is designed to work for INDEL, but it does not produce signature context.

    MuTect1 output format is NOT well defined

    There was a good forum post on formatting question

    Discerning the difference between two MuTect version output.

    This is astonishingly goodpost

    I am using Mutect and Mutect2 for WES project, and here are some useful document I collected overtime.

    MuTect1 was part of the GATK component.

    GATK protocol — revisit
    It has been some time since I used GATK. Now, much improvement has been made and I would like to revisit this software for a WES project.

    Here is one from SeqAnswer in 2012, but it is still very useful

    samtools sort .bwa.bam .bwa.sort
    
    #Index 
    samtools index .bwa.sort.bam
    
    #mark duplicate
    java -Xmx5g -jar MarkDuplicates.jar INPUT=.bwa.sort.bam OUTPUT=.bwa.sort.deduped.bam METRICS_FILE=.duplicates REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
    
    
    #Realignment based on known insert sites (Using Java 1.7 from now on as required by GATK)
    java -Xmx5g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Reference.fa -I .bwa.sort.deduped.arg.bam -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o .realign.intervals -S LENIENT
    
    java -Xmx5g -jar GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fa -I .bwa.sort.deduped.arg.bam -targetIntervals .realign.intervals -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o .bwa.sort.deduped.arg.realigned.bam -S LENIENT
    
    java -Xmx5g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Reference.fa -l INFO -I .bwa.sort.deduped.arg.realigned.bam -knownSites 1000G_phase1.indels.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites dbsnp_137.hg19.vcf -o .recalibration_report.grp -S LENIENT 
    
    java -Xmx5g -jar GenomeAnalysisTK.jar -T PrintReads -R Reference.fa -l INFO -I .bwa.sort.deduped.arg.realigned.bam -BQSR .recalibration_report.grp -o .bwa.sort.deduped.arg.realigned.recalibrated.bam -S LENIENT
    

    Calling somatic variant

    Came up with a paper on BMC genoics comparing five different somatic snp callers.

    GATK UnifiedGenotyper in NaiveSubtract
    MuTect1
    SomaticSniper from the original paper
    Installation help document for Strelka and the  original paper on Bioinformatics
    VarScan2 from the original paper
    Classical samtools method also works.
    
    

    Calling variants with samtools/bcftools

    Help from samtools protocol
    Help from samtools/bcftools protocol
    

    Working with cancer mutation signature

    Start from SomaticSignatures R package, I am documenting the effort with cancer mutation signature.

    Someone else is also interested in this and publishes signR

    Since I updated NMF, now this SomaticSignatures function breaks

    Popular WES analysis protocols

    GeneStack hits the top on the list, way to go!
    The most popular free tool goes to the bedtools with 
    an example and tutorial example by QuinLab
    More users' examples are available at the main bedtools manual
    

    Nicola Roberts has a Hierarchical Dirichlet Process to extract cancer mutation signature.

    License-based bioinformatics software