quick&dirty shell scripts linux

I am accumulating some cool shell scripts in linux

Scenario 1

Case: I own ” /nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv”, so Latasha’s concordance check fails

Solution:
Step 1. Create a shortList
Step 2. Run this script:
for k in `cat shortList `; do echo $k; qsub /nfs/seqsata12/ALIGNMENT/BUILD37/EXOME/$k/Scripts/concordance.sh; done;

Scenario 2

Case: “bwa59_b37_v4.pl” has a bug, therefore symbolic link does NOT work.

Solution
Step 1: create a shortList
Step 2: run the following script,

for k in `cat shortList `; do echo $k; cd /nfs/seqsata12/ALIGNMENT/BUILD37/EXOME/$k/combined ; ln -s var_flt_vcf var_flt_vcf.vcf ; cd ../../ ; done;

Scenario 3

Case: There are many columns in/nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv, and it is a comma-delimit files. I want a specific column

Solution

Step 1. Replacing the “,” by tab
Step 2. Use “cut -f” command
Step 3. Replacing unnecessary characters, i.e. quote, space etc.
Step 4. Grep the pattern..

Run the following scripts,
sed ‘s/,/t/g’ temp.csv | cut -f 4 | more
sed ‘s/,/t/g’ temp.csv | cut -f 4 | sed ‘s/”//g’ | grep -w “NA”

Scenario 4

Case: Finding record between two files

Solution

Step 1. Samples in a file that exist in another file (intersection)
for k in `cat file1`; do grep $k file2 > /dev/null && echo $k ; done;

Step 2. Samples in a file that do not exist in another file (negation)
for k in `cat file1`; do grep $k file2 > /dev/null || echo $k ; done;

Step 3. Real scenario:

for k in `cat samples_wo_topstrandFile`; do grep $k /nfs/sva01/SequenceAnalyses/Concordance/BUILD37/b37_exome_concordance.csv > /dev/null && echo $k ; done;

Scenario 5

Case: There are many columns in /nfs/chgv/seqraw01/summaries/exRealign_Summary_b37.txt, and we wanted to extract “sample name information” from column three. Unfortunately, there are various number of directory layers. Also, there are duplicate sample entries.

Solution

Step 1. Use “cut -f 3 ” to extract the third column
Step 2. The split column three by “/”
Step 3. The use awk to print the last field
Step 4. Sort the list and get unique only

Watch out for those “quote”, they many not work if directly copied to Unix/Linux environment. Special handling may be needed!

Run the following scripts,
cut -f 3 ~/exRealign_Summary_b37.tx | sed ‘s///t/g’ > allSamples.txt
awk ‘{print $NF}’ allSamples.txt > allSampleList.txt
sort -t $’t’ -k1,1 allSampleList.txt | uniq > uniqSampleList.txt

Okay, in a single line:

cut -f3 /nfs/chgv/seqraw01/summaries/exRealign_Summary_v1.8_02.02.12.txt | sed ‘s///t/g’ | awk ‘{print $NF}’ | sort -t $’t’ -k1,1 | uniq | wc -l

Scenario 6

Case: There are many columns in a file and are tab-delimited, I want to know how many columns it has

Solution

For tab-delimited:
[jl407@sva no_filter]$ awk -F’\t’ ‘{print NF;exit}’ ../concordanceMeta0001.part2.match.txt
3
Okay, there are only 3 columns!!

For space-delimited:
[li11@ehscmplp09/bioinfo1 RNAseqPrep]$ awk -F ‘ ‘ ‘{print NF; exit }’ refDB/UCSC_KnownGene_Dec2009/KnownGeneDec09.hg19.exon.txt
3

Bingo!!

Scenario 7

Case: I have a string and want to shorten a character in the beginning and end simultaneously.

Solution

$string =~ s/^.(.*).$/$1/;

Scenario 8

Case: I have two lists, and I want to get intersection, union, and unique list.

Solution

#Getting intersection
comm -12 <(sort noMatchedTotal_inHouse) <(sort noMatch_intersect)
grep -xF -f set1 set2
sort set1 set2 | uniq -d

#Getting unique list
comm -23 <(sort noMatchedTotal_inHouse) <(sort noMatch_intersect) > noMatch_uniq_inHouse

comm -23 <(sort noMatchedTotal_SVA) <(sort noMatch_intersect) > noMatch_uniq_SVA

#Getting union list
sort -n set1 set2 | uniq

cat set1 set2 | awk ‘!found[$1]++’

awk ‘!found[$1]++’ set1 set2
awk ‘{print $0}’ set1 set2 | sort -u

Scenario 9

Case: Working with two loops in bash search

for k in `cat ~/tmp_sample_list`; do echo $k; for f in seqsata01 seqsata02 seqsata03 seqsata04 seqsata05 seqsata06 seqsata07 seqsata08 seqsata09 seqsata10 seqsata11 ; do echo /nfs/$f/ALIGNMENT/BWA_OUTPUT; ls -l /nfs/$f/ALIGNMENT/BWA_OUTPUT | grep “$k”; done; done;

Scenario 10

Case: There was an error caused by loosing of a computer node “compute2 was shut down abruptly”. So, we had to replace all “copy.sh” scripts”. Here is the bash script Joe provided. Pretty cool!

for i in `ls -1 */Scripts/copy.sh | cut -d’/’ -f1`; do cat $i/Scripts/copy.sh | sed ‘s/compute2/sva10/g’ > $i/Scripts/newcopy.sh | mv $i/Scripts/newcopy.sh $i/Scripts/copy.sh ; done;

Please be aware that this is “-1” (one) after “ls” but not “l”

Scenario 11

Case: I have a huge text files and I wanted to strip out the first 3 lines.

To exclude the first three lines of a file
tail -n +3 file > newfile

To exclude/remove blank lines from a text file, use the following command:

$ sed ‘/^$/d’ input.txt > output.txt
$ grep -v ‘^$’ input.txt > output.txt

To remove whitespace after the ID in genomic sequence, I used this trick. It seems working pretty well.
perl -plane ‘s/\s+.+$//’ PipeIN genome.fa PipeOUT new_genome.fa

I have a huge text files and I wanted to strip out the last line, use the following GNU command

sed -i '$ d' foo.txt

Scenario 12

Case: I want to eliminate empty lines in a file

Solution, use “sed” to process files under linux

sed ‘/^$/d’ myFile > tt
mv tt myFile

Another useful application with “sed”, to delete lines matching the expression

sed ‘/^expression/ d’ infile.txt > outfile.txt

To parse out the “Read depth information in a VCF format file”, it becomes quite convenient

cut -f8 C3_M_fltVCF.vcf | cut -d ";" -f1 |  sed s'/DP=//' | sed s'/INDEL//' | sed '/^$/d' > temp.txt
mv temp.txt C3_M_fltVCF.vcf.DP.txt

Scenario 13

Have to seek for faster SNV calling with samtools, I found this –> mpileup function is very helpful

Solution

With xargs, I parallel the snp call by chromosomes (-P is the number of parallel processes started parallel

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

Then merge the results afterwards

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].vcf >> yourFile.vcf");'

Scenario 14

I have a list of imprinted SNP-genes across mouse genomes, I want to know how many SNP/genes in each chromosome. Therefore, I need a bash loop script to quickly get an assessment.

Solution:

for ((i=1; i<=19; i++)); do echo chr$i; grep -w chr$i GSE27016_imprinted_genes.mm9.bed | wc -l ; echo; done ;
for i in {1..19}; do echo chr$i; done;

Scenario 15

I have Illumina exome capture region file and want to know how many bases are covered. It can be extended to ask for the sum of any column of numeric data.

Solution:

awk '{SUM +=$5} END {print SUM; }' TruSeq-Exome-65mb-BED-file_with_MT 
62085295

awk '{SUM +=$3-$2} END {print SUM; }' SureSelect_All_Exon_37mb_hg19_bed_with_MT 
38831632

awk '{SUM +=$3-$2} END {print SUM; }' SureSelect_All_Exon_50mb_hg19_bed_with_MT 
51663197

Scenario 16

Du, Ying asks me to help her to append an extra unique ID column to a column file

awk '{print $1"_"$2"_"$3}' TruSeq-Exome-65mb-BED-file_with_MT  > temp.txt
paste -d" " temp.txt TruSeq-Exome-65mb-BED-file_with_MT > temp2.txt

Scenario 17

In our FPKM file, I noticed duplicates “IDs”. What are those duplicate records?

cd ~/project2012/ASE/FPKM_RNAseq/FPKM
cut -f1 C3_M_3.mm9.genes.fpkm_tracking.fpkm.txt | perl -ne 'print if $SEEN{$_}++'

Scenario 18

In my ASE list, it has ^M at the end of each line. I need to remove them in Unix

To view it: cat -v ASE_gene_jyl.txt
cat ASE_gene_jyl,txt | sed 's/^M//g' > temp
mv temp ASE_gene_jyl.txt

Scenario 19

I am working on getting the structured snp list requested by Dr. Fargo, and I need to feed snp numbers with leading zero. The answer is not with unix script but a perl script

my $num = "123"; 
$num = sprintf("%06d", $num);
$num=~ tr/ /0/;

Scenario 20

Justin from NCSU needs help to modify his bed file for Galaxy. Input if like this:

1	0	        3000236	0
1	3000236	3000272	1
1	3000272	3000311	0
1	3000311	3000347	1
1	3000347	3000628	0
1	3000628	3000645	1

And, he wanted:

chr1	0	        3000236	0
chr1	3000236	3000272	1
chr1	3000272	3000311	0
chr1	3000311	3000347	1
chr1	3000347	3000628	0
chr1	3000628	3000645	1

 

Solution

awk '{print "chr"$1}' fileIN.txt > modCol1 
cut -f2-4 fileIN.txt > reminders 
paste modCol1 reminders > newFile 
rm -f modCol1 
rm -f reminders 

Scenario 21: I wanted to extract file name from a string with path
../rawFastq/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq

Solution, well bash script

myString="../rawFastq/ES565_121204_MARMOSET_64NDWAAXX.2.3191261.single.sanger.fastq"
OIFS="$IFS"
IFS='/'
myArray=($myString)
IFS=$OIFS
echo ${#myArray[@]}
i=0
while [ $i -lt ${#myArray[@]} ]
do
    echo "Element $i:${myArray[$i]}"
    (( i=i+1 ))
done
#print out file name only
lastIndex=${#myArray[@]}-1
echo "${myArray[$lastIndex]}"

Scenario 22: I created unique 22mers from Rick’s microRNAseq project total 7257, and I wanted to get a fasta format of it

Solution, pure bash script

Head file looks like the following
"x"
"GAACCAGAAGGCTGTCTGAACT"
"AGAAGGCTGTCTGAACTTTTGA"
"AGGCTGTCTGAACTTTTGAAAC"
"GGAACGAGTGTGAGTCTGAAAC"
"GAACGAGTGTGAGTCTGAAACC"
"AACGAGTGTGAGTCTGAAACCA"

Step 1. Get rid of the first row: 
tail -n +2 overall_overlap.txt > temp.seq
mv temp.seq overall_overlap.txt

Step 2. Remove the "quotes":
cat overall_overlap.txt  | sed 's/\"//g' > temp.seq
mv temp.seq overall_overlap.txt

Step 3. Final script: 
for ((i=1; i\seq"$i; sed -n $i", "$i"p" overall_overlap.txt ; done > overall_overlap.fa

Scenario 23: There was a weird error with picard tool, and I need to delete a few lines in a large .sam file

It appears that it is unhappy to find two header lines for Bismark in 
the header (lines that start with "@PG     ID:Bismark").
Very weird.  If I take the first 1000 lines of your SAM file and put 
them in a new file, remove one of the "@PG   ID: Bismark" lines, and 
run sort then it doesn't give an error.  Interesting that I've ne
ver seen this error when working with these files.
Anyway, if you get rid of all but one of the @PG lines, it should be 
fine.  That might be a pain.  Sorry.
I think Picard has a "ReplaceHeader" utility (not sure of the exact 
name).  I've never used it, but these might be an ideal time to try it out!

The solution is: sed command as following

I love this sed help doc

Scenario 24

I wanted to use the regular expression in bash script, and further create processing script

I found a useful online help. So, I wrote my script like the following:
#!/usr/bin/bash
bam=$1
re="([^/]+).bam.sorted.bam"
if [[ $bam =~ $re ]]; then
       root=${BASH_REMATCH[1]}
fi
bcf=$root.var.raw.bcf
vcf=$root.var.flt.vcf

echo "samtools mpileup -C50 -uf /ddn/gs1/home/li11/project2014/Copeland/reference/hs_mit_seq.fasta  $bam | bcftools view -bvcg - > $bcf"  
echo "bcftools view $bcf | ~/tools/samtools-0.1.18/bcftools/vcfutils.pl  varFilter -D100 > $vcf"

Scenario 25

Removing unnecessary characters in a file

  • It has been this case in the past and with “cat” all hidden
  • characters are shown.
  • To remove them, here is the post
  • perl -i -pe ‘s/\r//g’ inputfile : remove unseen special characters
  • perl -i -pe ‘s/[^\x20-\x7f]//g’ examplefile : remove ALL unseen
  • special characters
  • Sometimes, it affects database columns and it needs to be
    cleaned to avoid query not getting back expected results.
  • This works better:
  • tr -cd ‘\11\12\40-\176’ PipeIN input PipeOUT myfile2

Okay, it does happen to me that the ctrl-M did NOT work.

For those of you saying this doesn’t work, I think I might have your 
answer. Please note this line in the article:

“Type the following command (to get ^M type CTRL+V followed by CTRL+M):”

You can’t just type the carat symbol and a capital M. You have to hit 
CTRL-V and then hit CTRL-M. I had quite a time figuring that out….
but once you do it right, it works great.

Scenario 26

Find a match and get the next line(s) or previous (lines)

Get the next line

  • grep -A1 some-file will do it

 

  • cat ../FastQC/MES1192_170504_M00421_B55HN.12-5224-1132-2.2.sanger_fastqc/fastqc_data.txt | grep -A2 “Overrepresented”

 

Get the previous line

  • cat adapter_condition.txt | grep -B1 “>>Adapter Content”

 

Scenario 27 — different checksum method

There are at least three checksum methods that one can encounter
a file Mus_musculus.GRCm38.cds.all.fa.gz from Ensembl. They can be different results and cause confusion.

bash-4.2$ sum Mus_musculus.GRCm38.cds.all.fa.gz 
40501 17135
bash-4.2$ cksum Mus_musculus.GRCm38.cds.all.fa.gz 
414929315 17545827 Mus_musculus.GRCm38.cds.all.fa.gz
bash-4.2$ md5sum Mus_musculus.GRCm38.cds.all.fa.gz 
9af75c6fe212439e45c58357f4e1cdac  Mus_musculus.GRCm38.cds.all.fa.gz

Scenario 28 — securely copy files in Linux

I have done some analysis on the server and want to securely transfer them onto my local linux machine. the command is “scp”, and an example would be something like the following:

scp li11@scigate.niehs.nih.gov:project2019/RNAseqProj/results/DE/DE_adjp05.zip .