I am accumulating some cool shell scripts in linux
Scenario 1
Case: I own ” /nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv”, so Latasha’s concordance check fails
Solution:
Step 1. Create a shortList
Step 2. Run this script:
for k in `cat shortList `; do echo $k; qsub /nfs/seqsata12/ALIGNMENT/BUILD37/EXOME/$k/Scripts/concordance.sh; done;
Scenario 2
Case: “bwa59_b37_v4.pl” has a bug, therefore symbolic link does NOT work.
Solution
Step 1: create a shortList
Step 2: run the following script,for k in `cat shortList `; do echo $k; cd /nfs/seqsata12/ALIGNMENT/BUILD37/EXOME/$k/combined ; ln -s var_flt_vcf var_flt_vcf.vcf ; cd ../../ ; done;
Scenario 3
Case: There are many columns in/nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv, and it is a comma-delimit files. I want a specific column
Solution
Step 1. Replacing the “,” by tab
Step 2. Use “cut -f” command
Step 3. Replacing unnecessary characters, i.e. quote, space etc.
Step 4. Grep the pattern..Run the following scripts,
sed ‘s/,/t/g’ temp.csv | cut -f 4 | more
sed ‘s/,/t/g’ temp.csv | cut -f 4 | sed ‘s/”//g’ | grep -w “NA”
Scenario 4
Case: Finding record between two files
Solution
Step 1. Samples in a file that exist in another file (intersection)
for k in `cat file1`; do grep $k file2 > /dev/null && echo $k ; done;Step 2. Samples in a file that do not exist in another file (negation)
for k in `cat file1`; do grep $k file2 > /dev/null || echo $k ; done;Step 3. Real scenario:
for k in `cat samples_wo_topstrandFile`; do grep $k /nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv > /dev/null && echo $k ; done;
Scenario 5
Case: There are many columns in /nfs/chgv/seqraw01/summaries/exRealign_Summary_b37.txt, and we wanted to extract “sample name information” from column three. Unfortunately, there are various number of directory layers. Also, there are duplicate sample entries.
Solution
Step 1. Use “cut -f 3 ” to extract the third column
Step 2. The split column three by “/”
Step 3. The use awk to print the last field
Step 4. Sort the list and get unique onlyWatch out for those “quote”, they many not work if directly copied to Unix/Linux environment. Special handling may be needed!
Run the following scripts,
cut -f 3 ~/exRealign_Summary_b37.tx | sed ‘s///t/g’ > allSamples.txt
awk ‘{print $NF}’ allSamples.txt > allSampleList.txt
sort -t $’t’ -k1,1 allSampleList.txt | uniq > uniqSampleList.txtOkay, in a single line:
cut -f3 /nfs/chgv/seqraw01/summaries/exRealign_Summary_v1.8_02.02.12.txt | sed ‘s///t/g’ | awk ‘{print $NF}’ | sort -t $’t’ -k1,1 | uniq | wc -l
Scenario 6
Case: There are many columns in a file and are tab-delimited, I want to know how many columns it has
Solution
For tab-delimited:
[jl407@sva no_filter]$ awk -F’\t’ ‘{print NF;exit}’ ../concordanceMeta0001.part2.match.txt
3
Okay, there are only 3 columns!!For space-delimited:
[li11@ehscmplp09/bioinfo1 RNAseqPrep]$ awk -F ‘ ‘ ‘{print NF; exit }’ refDB/UCSC_KnownGene_Dec2009/KnownGeneDec09.hg19.exon.txt
3Bingo!!
Scenario 7
Case: I have a string and want to shorten a character in the beginning and end simultaneously.
Solution
$string =~ s/^.(.*).$/$1/;
Scenario 8
Case: I have two lists, and I want to get intersection, union, and unique list.
Solution
#Getting intersection
comm -12 <(sort noMatchedTotal_inHouse) <(sort noMatch_intersect)
grep -xF -f set1 set2
sort set1 set2 | uniq -d#Getting unique list
comm -23 <(sort noMatchedTotal_inHouse) <(sort noMatch_intersect) > noMatch_uniq_inHousecomm -23 <(sort noMatchedTotal_SVA) <(sort noMatch_intersect) > noMatch_uniq_SVA
#Getting union list
sort -n set1 set2 | uniqcat set1 set2 | awk ‘!found[$1]++’
awk ‘!found[$1]++’ set1 set2
awk ‘{print $0}’ set1 set2 | sort -u
Scenario 9
Case: Working with two loops in bash search
for k in `cat ~/tmp_sample_list`; do echo $k; for f in seqsata01 seqsata02 seqsata03 seqsata04 seqsata05 seqsata06 seqsata07 seqsata08 seqsata09 seqsata10 seqsata11 ; do echo /nfs/$f/ALIGNMENT/BWA_OUTPUT; ls -l /nfs/$f/ALIGNMENT/BWA_OUTPUT | grep “$k”; done; done;
Scenario 10
Case: There was an error caused by loosing of a computer node “compute2 was shut down abruptly”. So, we had to replace all “copy.sh” scripts”. Here is the bash script Joe provided. Pretty cool!
for i in `ls -1 */Scripts/copy.sh | cut -d’/’ -f1`; do cat $i/Scripts/copy.sh | sed ‘s/compute2/sva10/g’ > $i/Scripts/newcopy.sh | mv $i/Scripts/newcopy.sh $i/Scripts/copy.sh ; done;
Please be aware that this is “-1” (one) after “ls” but not “l”
Scenario 11
Case: I have a huge text files and I wanted to strip out the first 3 lines.
To exclude the first three lines of a file
tail -n +3 file > newfileTo exclude/remove blank lines from a text file, use the following command:
$ sed ‘/^$/d’ input.txt > output.txt
$ grep -v ‘^$’ input.txt > output.txtTo remove whitespace after the ID in genomic sequence, I used this trick. It seems working pretty well.
perl -plane ‘s/\s+.+$//’ PipeIN genome.fa PipeOUT new_genome.faI have a huge text files and I wanted to strip out the last line, use the following GNU command
sed -i '$ d' foo.txt
Scenario 12
Case: I want to eliminate empty lines in a file
Solution, use “sed” to process files under linux
sed ‘/^$/d’ myFile > tt
mv tt myFile
Another useful application with “sed”, to delete lines matching the expression
sed ‘/^expression/ d’ infile.txt > outfile.txt
To parse out the “Read depth information in a VCF format file”, it becomes quite convenient
cut -f8 C3_M_fltVCF.vcf | cut -d ";" -f1 | sed s'/DP=//' | sed s'/INDEL//' | sed '/^$/d' > temp.txt mv temp.txt C3_M_fltVCF.vcf.DP.txt
Scenario 13
Have to seek for faster SNV calling with samtools, I found this –> mpileup function is very helpful
Solution
With xargs, I parallel the snp call by chromosomes (-P is the number of parallel processes started parallel
samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"
Then merge the results afterwards
samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].vcf >> yourFile.vcf");'
Scenario 14
I have a list of imprinted SNP-genes across mouse genomes, I want to know how many SNP/genes in each chromosome. Therefore, I need a bash loop script to quickly get an assessment.
Solution:
for ((i=1; i<=19; i++)); do echo chr$i; grep -w chr$i GSE27016_imprinted_genes.mm9.bed | wc -l ; echo; done ; for i in {1..19}; do echo chr$i; done;
Scenario 15
I have Illumina exome capture region file and want to know how many bases are covered. It can be extended to ask for the sum of any column of numeric data.
Solution:
awk '{SUM +=$5} END {print SUM; }' TruSeq-Exome-65mb-BED-file_with_MT 62085295 awk '{SUM +=$3-$2} END {print SUM; }' SureSelect_All_Exon_37mb_hg19_bed_with_MT 38831632 awk '{SUM +=$3-$2} END {print SUM; }' SureSelect_All_Exon_50mb_hg19_bed_with_MT 51663197
Scenario 16
Du, Ying asks me to help her to append an extra unique ID column to a column file
awk '{print $1"_"$2"_"$3}' TruSeq-Exome-65mb-BED-file_with_MT > temp.txt paste -d" " temp.txt TruSeq-Exome-65mb-BED-file_with_MT > temp2.txt
Scenario 17
In our FPKM file, I noticed duplicates “IDs”. What are those duplicate records?
cd ~/project2012/ASE/FPKM_RNAseq/FPKM cut -f1 C3_M_3.mm9.genes.fpkm_tracking.fpkm.txt | perl -ne 'print if $SEEN{$_}++'
Scenario 18
In my ASE list, it has ^M at the end of each line. I need to remove them in Unix
To view it: cat -v ASE_gene_jyl.txt cat ASE_gene_jyl,txt | sed 's/^M//g' > temp mv temp ASE_gene_jyl.txt
Scenario 19
I am working on getting the structured snp list requested by Dr. Fargo, and I need to feed snp numbers with leading zero. The answer is not with unix script but a perl script
my $num = "123"; $num = sprintf("%06d", $num); $num=~ tr/ /0/;
Scenario 20
Justin from NCSU needs help to modify his bed file for Galaxy. Input if like this:
1 0 3000236 0 1 3000236 3000272 1 1 3000272 3000311 0 1 3000311 3000347 1 1 3000347 3000628 0 1 3000628 3000645 1
And, he wanted:
chr1 0 3000236 0 chr1 3000236 3000272 1 chr1 3000272 3000311 0 chr1 3000311 3000347 1 chr1 3000347 3000628 0 chr1 3000628 3000645 1
Solution
awk '{print "chr"$1}' fileIN.txt > modCol1 cut -f2-4 fileIN.txt > reminders paste modCol1 reminders > newFile rm -f modCol1 rm -f reminders
Scenario 21: I wanted to extract file name from a string with path
../rawFastq/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq
Solution, well bash script
myString="../rawFastq/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq" OIFS="$IFS" IFS='/' myArray=($myString) IFS=$OIFS echo ${#myArray[@]} i=0 while [ $i -lt ${#myArray[@]} ] do echo "Element $i:${myArray[$i]}" (( i=i+1 )) done #print out file name only lastIndex=${#myArray[@]}-1 echo "${myArray[$lastIndex]}"
Scenario 22: I created unique 22mers from Rick’s microRNAseq project total 7257, and I wanted to get a fasta format of it
Solution, pure bash script
Head file looks like the following "x" "GAACCAGAAGGCTGTCTGAACT" "AGAAGGCTGTCTGAACTTTTGA" "AGGCTGTCTGAACTTTTGAAAC" "GGAACGAGTGTGAGTCTGAAAC" "GAACGAGTGTGAGTCTGAAACC" "AACGAGTGTGAGTCTGAAACCA" Step 1. Get rid of the first row: tail -n +2 overall_overlap.txt > temp.seq mv temp.seq overall_overlap.txt Step 2. Remove the "quotes": cat overall_overlap.txt | sed 's/\"//g' > temp.seq mv temp.seq overall_overlap.txt Step 3. Final script: for ((i=1; i\seq"$i; sed -n $i", "$i"p" overall_overlap.txt ; done > overall_overlap.fa
Scenario 23: There was a weird error with picard tool, and I need to delete a few lines in a large .sam file
It appears that it is unhappy to find two header lines for Bismark in the header (lines that start with "@PG ID:Bismark"). Very weird. If I take the first 1000 lines of your SAM file and put them in a new file, remove one of the "@PG ID: Bismark" lines, and run sort then it doesn't give an error. Interesting that I've ne ver seen this error when working with these files. Anyway, if you get rid of all but one of the @PG lines, it should be fine. That might be a pain. Sorry. I think Picard has a "ReplaceHeader" utility (not sure of the exact name). I've never used it, but these might be an ideal time to try it out!
The solution is: sed command as following
I love this sed help doc
Scenario 24
I wanted to use the regular expression in bash script, and further create processing script
I found a useful online help. So, I wrote my script like the following:
#!/usr/bin/bash bam=$1 re="([^/]+).bam.sorted.bam" if [[ $bam =~ $re ]]; then root=${BASH_REMATCH[1]} fi bcf=$root.var.raw.bcf vcf=$root.var.flt.vcf echo "samtools mpileup -C50 -uf /ddn/gs1/home/li11/project2014/Copeland/reference/hs_mit_seq.fasta $bam | bcftools view -bvcg - > $bcf" echo "bcftools view $bcf | ~/tools/samtools-0.1.18/bcftools/vcfutils.pl varFilter -D100 > $vcf"
Scenario 25
Removing unnecessary characters in a file
- It has been this case in the past and with “cat” all hidden
- characters are shown.
- To remove them, here is the post
- perl -i -pe ‘s/\r//g’ inputfile : remove unseen special characters
- perl -i -pe ‘s/[^\x20-\x7f]//g’ examplefile : remove ALL unseen
- special characters
- Sometimes, it affects database columns and it needs to be
cleaned to avoid query not getting back expected results.
- This works better:
- tr -cd ‘\11\12\40-\176’ PipeIN input PipeOUT myfile2
Okay, it does happen to me that the ctrl-M did NOT work.
For those of you saying this doesn’t work, I think I might have your answer. Please note this line in the article: “Type the following command (to get ^M type CTRL+V followed by CTRL+M):” You can’t just type the carat symbol and a capital M. You have to hit CTRL-V and then hit CTRL-M. I had quite a time figuring that out…. but once you do it right, it works great.
Scenario 26
Find a match and get the next line(s) or previous (lines)
Get the next line
- grep -A1 some-file will do it
- cat ../FastQC/MES1192_170504_M00421_B55HN.12-5224-1132-2.2.sanger_fastqc/fastqc_data.txt | grep -A2 “Overrepresented”
Get the previous line
- cat adapter_condition.txt | grep -B1 “>>Adapter Content”
Scenario 27 — different checksum method
There are at least three checksum methods that one can encounter
a file Mus_musculus.GRCm38.cds.all.fa.gz from Ensembl. They can be different results and cause confusion.
bash-4.2$ sum Mus_musculus.GRCm38.cds.all.fa.gz 40501 17135 bash-4.2$ cksum Mus_musculus.GRCm38.cds.all.fa.gz 414929315 17545827 Mus_musculus.GRCm38.cds.all.fa.gz bash-4.2$ md5sum Mus_musculus.GRCm38.cds.all.fa.gz 9af75c6fe212439e45c58357f4e1cdac Mus_musculus.GRCm38.cds.all.fa.gz
Scenario 28 — securely copy files in Linux
I have done some analysis on the server and want to securely transfer them onto my local linux machine. the command is “scp”, and an example would be something like the following:
scp li11@scigate.niehs.nih.gov:project2019/RNAseqProj/results/DE/DE_adjp05.zip .