Claire McWhite's Site – projects, etc.

Personal reference for useful one liners/short coding tasks/anything I’ve ever had to google more than once.

Things that are now obvious kept for posterity’s sake

Linux Command Line

Blast a FASTA against another FASTA
 makeblastdb -in uniprot-proteome%3AUP000000763.fasta -dbtype prot -out uniprot-proteome%3AUP000000763.fasta.db
 blastp -num_threads 2 -evalue 1e-6 -use_sw_tback -outfmt 6 -query msu.rice.pep -db uniprot-proteome%3AUP000000763.fasta.db > result
Run a script on multiple files in a directory
for file in *2col; do echo $file; done
Copy terminal output to clipboard

 alias pbcopy='xclip -selection clipboard'
 alias pbpaste='xclip -selection clipboard -o'
 whatever | pbcopy

Add a .gitkeep statement recursively so that empty directories are kept in git structure


find . -type d -empty -not -path "./.git/*" -exec touch {}/.gitkeep \; 
Convert pdf to png using imagemagick
  for f in *pdf
      convert -density 150 -antialias "$f" -append -resize 1024x -quality 100 "${f%.pdf}.png"
Convert variable space separated table to tab separated table
    unexpand -a file.txt > ##### Size of a directory in human readable format (ex. Kb, Mb) - disk usage -sum human

    du -sh ##### Count occurences of a string within one line gsub counts the number of substitutions made, as a proxy for number of matches. source:

    awk '{print gsub(/string/,"")}' file
find files matching a pattern (“*PROTEOME) in subdirectories and move them up a level
    find . -type f -name \*PROTEOME
    find . -type f -name \*PROTEOME -exec mv {} . \;
Running parallel
    ls *PROTEOME | ~/bin/parallel -j 3 bash {}
    # *PROTEOME is the argument to
    #3 is the number of processes
See what processes are running in a directory
    lsof +D /path/to/directory/
Kill all processes from a user
    killall --user nameofuser
Place something between two sequential tabs, Ex. NA. (Source)
    awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "NA" }; 1' >
Or statement in grep
    grep "cilia\|axone" genome.txt > ciliagenes.txt
Most recent 5 modified files
 ls -1t -l | head -5
remove a line if it has a blank
 awk '!/^\t|\t\t|\t$/' | awk '!a[$0]++' >
Delete the first line of a file
 sed -i '1d' file.txt
Create a backup while doing inplace sed
 sed -i.bak '1d' file.txt

Bash script

Loop through files in a directory
    for f in out_*.txt
           echo $f
    done ##### Append to a file
    python >> holder.txt
Command line arguments
Current directory
Get filename without the extension, ex. filename.txt -> filename


Remove empty fastq entries


Matches and removes the pattern: @Illumina_header




Indent block of text 4 spaces

Esc Vj:le 4

In Visual mode, each additional j selects another line

Delete a block of text

in normal mode, type ma at beginning of block and d’a at end of block. “mark a” then “delete to a”

Show lines in vim
set nu ##### Show hidden characters   
 set list
Visual mode things

esc v gets into visual mode

y copies

d cuts

p pastes

in command mode:

$ goes to end of the line

shift-g to bottom of document

gg to top of document

e moves cursor forward faster

ctrl-f forward a page

ctrl-g back a page


Add /g to the end of the :s/a/b statement to replace multiple occurences of pattern in line Without /g, it will only replace the first occurance

Replace only the last occurence of a string in a line %s/.*\zsmcl/mclLECA/ is a greedy replacement that replaces only the last occurence of a string source:

replace within a range of lines
:1,10s/a/b ##### Split two Ensembl identifiers between a number and a letter


ex. ENS0000001ENS0000002 -> ENS0000001 ENS0000002


Read in a file as a single string
    r = open(filename, "r")
    rejectstring ="\n","")
When python register doesn’t work…do sudo register
 claire2@cLAIRE-pc:~/test$ python register 
 running register 
 running egg_info 
 deleting test.egg-info/requires.txt
 error: [Errno 13] Permission denied: 'test.egg-info/requires.txt'
 claire2@cLAIRE-pc:~/test$ sudo python register   
 now it works ##### Get current working directory

    import os
    currentwd = os.getcwd()
Install pandas on TACC
 git clone the github repository
 python build_ext --inplace --force --user
Arcane TACC launcher error on stampede2
 expr: non-integer argument 
 In .sbatch script, make sure #SBATCH -n (number of tasks) is evenly divisible by #SBATCH -N (number of nodes).
 The expr error is from line 87 of $LAUNCHER_DIR/paramrun: 

 If n isn't divisible by N, $LAUNCHER_PPN gets a value like '2,1', or '3,2', instead of being a single number like it needs to be for the line 87 division to happen, and LAUNCHER_NPROCS to be set. 

Standard setup for TACC slurm sbatch parallel launcher script (commands file is one command per line)

#SBATCH -J runjob      # job name
#SBATCH -o runjob.o%j       # output and error file name (%j expands to jobID)
#SBATCH -n 100             # number of tasks
#SBATCH -N 5              # total number of nodes requested
#SBATCH -p normal     # queue (partition) -- normal, development, etc.
#SBATCH -t 2:00:00        # run time (hh:mm:ss) - 1.5 hours
#SBATCH --mail-type=begin  # email me when the job starts
#SBATCH --mail-type=end    # email me when the job finishes
#SBATCH -A A-cm10          # Specify allocation to charge against

module load launcher
export EXECUTABLE=$LAUNCHER_DIR/init_launcher
export LAUNCHER_WORKDIR=$( pwd )
Make a python package
 Use skeleton of flupan
Import python3 print functions
from __future__ import print_function
Apply pairwise comparison function to all possible pair combinations of rows in a matrix
 # normally np.correlate([1, 2, 3], [0, 1, 0.5]) = array [3.5]
 dist.squareform( dist.pdist(df, lambda x,y: np.correlate(x,y)) )
argparse Command line args
import argparse

parser = argparse.ArgumentParser(description='Whatever')

parser.add_argument('identified_elution', action="store", type=str)
parser.add_argument("--prots", action="store", dest="proteins", nargs='+', required=False)
parser.add_argument('--bla', action="store", type=str)
inputs = parser.parse_args()
sys.argv Command line arguments. Not using these anymore in favor of argparse
import sys
infile = sys.argv[1]
Skip first argv
    args = sys.argv[1:]
Check that there are the right number of command line arguments
    if len(sys.argv) != 2:
        print "Please provide infile as a command-line argument"
Check if one string contains another string
    if string1 in string 2:
Get unique members of a list
    uniqlist = list(set(nonUniqList)
Make a modified filename out of another filename
    g1 = open("20112015%s" % infile, "w")


Many useful pandas things
Query a dataframe
df.query("COL1==a") ##### Dataframe to list of lists

    ListA = dfA.values.tolist()
Column bind - equivalent to R cbind()
pd.concat([a], [b], axis = 1)

(axis = 0 for row bind)

Row sum

(axis = 0 for column sum)

Move row index to a column
Change csv to string to manipulate values (save pandas df with integers instead of floats)
 import csv
 import StringIO
 t = t.replace(".0", "")
 t = t.replace(",0", ",")
 filename=open(path/to/save, "w")
Get value from location in dataframe
 first_value_in_COL1 = df['COL1'].iloc[0] 
Save a pandas dataframe
 df.to_csv(filenameA, sep="\t", index=True)
Make an empty dataframe
 cols = ['hold']
 df = DataFrame(columns = cols)
Set two level column index
 df.columns = pd.MultiIndex.from_tuples([a, b])
Take certain columns from a pandas dataframe
 cols = ['col1', 'col2']
 final = original[cols]


Insufficient permissions error
    cd <path to repo>
    cd .git/objects
    sudo chown -R username:username *
Three main commands
    git add [file]
    git commit -a 
    git push
Clone a repository
    git clone <address from github> ### <font color="red">BioPython</font>
get ORF from sequence
 def orf(s):
     length = len(s)
     i = 0
     while i<length-2:
        tri = s[i:i+3]
        if tri == "atg":
        i = i + 1
    j = i
    while j<length-2:
        tri = s[j:j+3]
        if tri == "tga" or tri == "taa" or tri == "tag":
        j = j + 3
    seq = Seq(s[i:j], generic_dna)
    return seq
Parse a FASTA file
handle = open("filename.fasta", "rU")
for record in SeqIO.parse(handle, "fasta"):


Command line arguments
Get to R studio server
In browser, go to http://<yourserverIP>:8787
Better R color palette
    cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
    # grey, orange, blue, forest green, banana yellow, navy blue, red, purplypink
My R color palette
 palette <- c("#FF0000","#0072B2","#E69F00","#009E24", "#979797","#5530AA", "#111111")
 #Red, Blue, Orange,Green, Grey, Purple,Black
Sort a vector by another vector function
sort_func <- function(x){ 
        y <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "T", "U", "V", "W", "X", "Y", "Z", "R", "S") 
  z <- unlist(strsplit(x, ""))
  return(paste(z[order(match(z, y))], collapse=""))

 #sort_func("ZRACB") outputs "ABCZR"

 #Apply to a column
 df %>% 
       rowwise %>% 
       mutate(sortedCode = sort_func(Code))
Get r.squared from a linear regression
    a <- summary(lm(a ~ b), data =d)
    a$r.squared ##### Make r variables on the fly, and assign them values (very rare to do)

    ct <- 1
    assign(paste("value.", ct, sep=""), 5)
    ct <- ct + 1
    assign(paste("value.", ct, sep=""), 10)
    #value.1 = 5
    #value.2 = 10
Make a bar chart with colored groups
    r.vals <- c(value.1a, value.2a, value.1b, value.2b)  #values
    r.names<-c(r.1, r.2, r.3, r.4)  #names
    groups <- c("group1", "group2", "group1", "group2" #categories
    df <- data.frame(r.square = vals, names = factor(r.names, levels=r.names), groups)
    cbbPalette <- c("#000000", "#E69F00") #black and orange
    a<-ggplot(aes(x = names, y = r.square), data = df) +
        geom_bar(stat = 'identity', aes(fill=rows)) +
        ylab(expression(paste("Variance Explained (R"^"2", ')', sep=''))) +
        xlab("Predictor Variables") + scale_fill_manual(values=cbbPalette) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
        scale_y_continuous(limits = c(0, 0.3))
Make binary black and white heatmap on table of 1’s and blanks
 df <- read.csv(gene, sep="\t", header=TRUE, row.names=1)
 m <- as.matrix(df, rownames.force=TRUE)
 class(m) <- "numeric"
 m[m==""] <- 0
 m[] <- 0
 nr <- nrow(m)
 heatmap.2(m, scale="none", col=c("white", "black"), cexRow=0.2/log10(nr), trace="none", colsep=c(1,2,3,4,5,6,7,8,9,10), sepcolor="grey", sepwidth=0.01, key=FALSE, xlab="DATABASES", ylab="GENES", margins=c(15,10))
Heatmap with overlayed values
    myPanel <- function(x, y, z, ...) {
    panel.text(x, y, round(z,2))
    colors <- colorRampPalette(c('red', 'seashell'))(256)

    # fullcorr is a numeric 15x15 matrix
    hmap <- print(levelplot(fullcorr[,15:1],  xlab="", ylab="",panel = myPanel, col.regions=colors ,scales=list(tck=0, x=list(rot=45,alternating=2)))) ##### Make a multipanel figure with cowplot

    ab <- qplot(a, b)
    bc <- qplot(b, c)
    cd <- qplot(c, d)
    de <- qplot(d, e)
    allplotted <- plot_grid(ab, bc, cd,de, labels = c("A", "B", "C", "D"), ncol = 2)




Indexed from 1 SELECT IFNULL(“colname”, “None”) NOW()

Selecting data
    SELECT * FROM <tablename> select all columns from a table
    WHERE <columname> = "Texas" AND <columnname2> > 5 AND <columname3> in ("USA", "MEX") conditions
    ORDER BY <columname> desc; order the rows descending
String functions (from

concat select concat(‘a’, ‘’, ‘b’); replace select replace(‘a_b_c’, ‘’, ‘.’); trim select trim(‘ 123 ‘);

Numeric functions (from

round select round(78421/100, 2); % (modulus) (If/Else conditional) select if( 587 % 2 = 0, ‘even’, ‘odd’);

Date functions (from

now select now(); year, monthname select year(now()) as year, monthname(now()) as month;

Get data from UCSC genome browser

First, connect to remote select name, chrom, strand, cast(exonStarts as char(100)) as exon_starts_data, convert(exonEnds, char(100)) as exon_ends_data from sacCer3.sgdGene limit 200;

Make a view of the data
    create or replace view my_saved_view as
   select round(avg(GNP * 1000000 / Population), 0) as "Per capita GNP ($)",
   round(std(GNP * 1000000 / Population), 0) as "Std deviation ($)"
    from country;