Claire McWhite's Site – projects, etc.

Personal reference for useful one liners/short coding tasks/anything I’ve ever had to google more than once.

Things that are now obvious kept for posterity’s sake

Linux Command Line

Blast a FASTA against another FASTA
 makeblastdb -in uniprot-proteome%3AUP000000763.fasta -dbtype prot -out uniprot-proteome%3AUP000000763.fasta.db
 blastp -num_threads 2 -evalue 1e-6 -use_sw_tback -outfmt 6 -query msu.rice.pep -db uniprot-proteome%3AUP000000763.fasta.db > result
Run a script on multiple files in a directory
for file in *2col; do echo $file; done
Copy terminal output to clipboard

 alias pbcopy='xclip -selection clipboard'
 alias pbpaste='xclip -selection clipboard -o'
 whatever | pbcopy

Add a .gitkeep statement recursively so that empty directories are kept in git structure


find . -type d -empty -not -path "./.git/*" -exec touch {}/.gitkeep \; 
Convert pdf to png using imagemagick
  for f in *pdf
      convert -density 150 -antialias "$f" -append -resize 1024x -quality 100 "${f%.pdf}.png"
Convert variable space separated table to tab separated table
    unexpand -a file.txt > ##### Size of a directory in human readable format (ex. Kb, Mb) - disk usage -sum human

    du -sh ##### Count occurences of a string within one line gsub counts the number of substitutions made, as a proxy for number of matches. source:

    awk '{print gsub(/string/,"")}' file
find files matching a pattern (“*PROTEOME) in subdirectories and move them up a level
    find . -type f -name \*PROTEOME
    find . -type f -name \*PROTEOME -exec mv {} . \;
Running parallel
    ls *PROTEOME | ~/bin/parallel -j 3 bash {}
    # *PROTEOME is the argument to
    #3 is the number of processes
See what processes are running in a directory
    lsof +D /path/to/directory/
Kill all processes from a user
    killall --user nameofuser
Place something between two sequential tabs, Ex. NA. (Source)
    awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "NA" }; 1' >
Or statement in grep
    grep "cilia\|axone" genome.txt > ciliagenes.txt
Most recent 5 modified files
 ls -1t -l | head -5
remove a line if it has a blank
 awk '!/^\t|\t\t|\t$/' | awk '!a[$0]++' >
Delete the first line of a file
 sed -i '1d' file.txt
Create a backup while doing inplace sed
 sed -i.bak '1d' file.txt

Bash script

Loop through files in a directory
    for f in out_*.txt
           echo $f
    done ##### Append to a file
    python >> holder.txt
Command line arguments
Current directory
Get filename without the extension, ex. filename.txt -> filename


Remove empty fastq entries


Matches and removes the pattern: @Illumina_header




Indent block of text 4 spaces

Esc Vj:le 4

In Visual mode, each additional j selects another line

Delete a block of text

in normal mode, type ma at beginning of block and d’a at end of block. “mark a” then “delete to a”

Show lines in vim
set nu ##### Show hidden characters   
 set list
Visual mode things

esc v gets into visual mode

y copies

d cuts

p pastes

in command mode:

$ goes to end of the line

shift-g to bottom of document

gg to top of document

e moves cursor forward faster

ctrl-f forward a page

ctrl-g back a page


Add /g to the end of the :s/a/b statement to replace multiple occurences of pattern in line Without /g, it will only replace the first occurance

Replace only the last occurence of a string in a line %s/.*\zsmcl/mclLECA/ is a greedy replacement that replaces only the last occurence of a string source:

replace within a range of lines
:1,10s/a/b ##### Split two Ensembl identifiers between a number and a letter


ex. ENS0000001ENS0000002 -> ENS0000001 ENS0000002


Read in a file as a single string
    r = open(filename, "r")
    rejectstring ="\n","")
When python register doesn’t work…do sudo register
 claire2@cLAIRE-pc:~/test$ python register 
 running register 
 running egg_info 
 deleting test.egg-info/requires.txt
 error: [Errno 13] Permission denied: 'test.egg-info/requires.txt'
 claire2@cLAIRE-pc:~/test$ sudo python register   
 now it works ##### Get current working directory

    import os
    currentwd = os.getcwd()
Install pandas on TACC
 git clone the github repository
 python build_ext --inplace --force --user
Arcane TACC launcher error on stampede2
 expr: non-integer argument 
 In .sbatch script, make sure #SBATCH -n (number of tasks) is evenly divisible by #SBATCH -N (number of nodes).
 The expr error is from line 87 of $LAUNCHER_DIR/paramrun: 

 If n isn't divisible by N, $LAUNCHER_PPN gets a value like '2,1', or '3,2', instead of being a single number like it needs to be for the line 87 division to happen, and LAUNCHER_NPROCS to be set. 

Standard setup for TACC slurm sbatch parallel launcher script (commands file is one command per line)

#SBATCH -J runjob      # job name
#SBATCH -o runjob.o%j       # output and error file name (%j expands to jobID)
#SBATCH -n 100             # number of tasks
#SBATCH -N 5              # total number of nodes requested
#SBATCH -p normal     # queue (partition) -- normal, development, etc.
#SBATCH -t 2:00:00        # run time (hh:mm:ss) - 1.5 hours
#SBATCH --mail-type=begin  # email me when the job starts
#SBATCH --mail-type=end    # email me when the job finishes
#SBATCH -A A-cm10          # Specify allocation to charge against

module load launcher
export EXECUTABLE=$LAUNCHER_DIR/init_launcher
export LAUNCHER_WORKDIR=$( pwd )
Make a python package
 Use skeleton of flupan
Import python3 print functions
from __future__ import print_function
Apply pairwise comparison function to all possible pair combinations of rows in a matrix
 # normally np.correlate([1, 2, 3], [0, 1, 0.5]) = array [3.5]
 dist.squareform( dist.pdist(df, lambda x,y: np.correlate(x,y)) )
argparse Command line args
import argparse

parser = argparse.ArgumentParser(description='Whatever')

parser.add_argument('identified_elution', action="store", type=str)
parser.add_argument("--prots", action="store", dest="proteins", nargs='+', required=False)
parser.add_argument('--bla', action="store", type=str)
inputs = parser.parse_args()
sys.argv Command line arguments. Not using these anymore in favor of argparse
import sys
infile = sys.argv[1]
Skip first argv
    args = sys.argv[1:]
Check that there are the right number of command line arguments
    if len(sys.argv) != 2:
        print "Please provide infile as a command-line argument"
Check if one string contains another string
    if string1 in string 2:
Get unique members of a list
    uniqlist = list(set(nonUniqList)
Make a modified filename out of another filename
    g1 = open("20112015%s" % infile, "w")


Many useful pandas things
Query a dataframe
df.query("COL1==a") ##### Dataframe to list of lists

    ListA = dfA.values.tolist()
Column bind - equivalent to R cbind()
pd.concat([a], [b], axis = 1)

(axis = 0 for row bind)

Row sum

(axis = 0 for column sum)

Move row index to a column
Change csv to string to manipulate values (save pandas df with integers instead of floats)
 import csv
 import StringIO
 t = t.replace(".0", "")
 t = t.replace(",0", ",")
 filename=open(path/to/save, "w")
Get value from location in dataframe
 first_value_in_COL1 = df['COL1'].iloc[0] 
Save a pandas dataframe
 df.to_csv(filenameA, sep="\t", index=True)
Make an empty dataframe
 cols = ['hold']
 df = DataFrame(columns = cols)
Set two level column index
 df.columns = pd.MultiIndex.from_tuples([a, b])
Take certain columns from a pandas dataframe
 cols = ['col1', 'col2']
 final = original[cols]
Pandas apply returns weird format string

If the first row of a dataframe has no output returned from the applied function, then the entire output from the apply becomes a string in the second column of the final dataframe.

ENOG411DPYI,” ID1 ID2 pearsonR 0 NaN NaN NaN” ENOG411DPZG,” ID1 ID2 pearsonR 0 sp|A0A1D5XGW0|DMS1B_WHEAT sp|A0A1X9QHJ0|DMS1D_WHEAT 0.954413”

Instead of how it should be: “ID, ID1, ID2, pearsonR”

This is as side effect of flexible apply, which uses the first group to get formatting for the rest of the groups. If the first group has no output, the rest of the group’s output is misformatted.

To fix, filter out the problematic row before the apply grouped_df.filter(lambda x: len(x) > 1)

Chimera command line

Match pdb to a chain
    Open up reference and pdb to dock in 
    Tools -> Structure Comparison -> MatchMaker
Select an atom, format is #ChainID:residue@typeofatom
    Open the command line (Tools -> General Control -> Command line
    In the line, type #ChainID:residue@typeofatom
    sel #1:113@CA 
    Need to select just one atom from each residue (composed of multiple atoms)
    sel #1:114@CA #13:31@CA
    Make sure the bottom bar says "two atoms" 
    Open the Tools > Structure Analysis > Distances window
    Click create and a line will appear on the structure between the two residues
    Turn off label
Deselect all
    ~sel ##### Select all
Make transparent chains
   First define a transparent color
   colordef transred 1 0 0 0.3
   select *whatever you want to color*
   color transred sel


Insufficient permissions error
    cd <path to repo>
    cd .git/objects
    sudo chown -R username:username *
Three main commands
    git add [file]
    git commit -a 
    git push
Clone a repository
    git clone <address from github> ### <font color="red">BioPython</font>
get ORF from sequence
 def orf(s):
     length = len(s)
     i = 0
     while i<length-2:
        tri = s[i:i+3]
        if tri == "atg":
        i = i + 1
    j = i
    while j<length-2:
        tri = s[j:j+3]
        if tri == "tga" or tri == "taa" or tri == "tag":
        j = j + 3
    seq = Seq(s[i:j], generic_dna)
    return seq
Parse a FASTA file
handle = open("filename.fasta", "rU")
for record in SeqIO.parse(handle, "fasta"):


Command line arguments
Theme that always outputs text in size 8 fonts in images
theme_cowplot_consistent_text <- function (font_size = 8) {
theme_cowplot() %+replace%
    theme(strip.text = element_text(size = font_size),
         axis.text = element_text(colour = "black", size = font_size),
         plot.title = element_text(size = font_size),
         axis.title = element_text(size = font_size),
         legend.text = element_text(size = font_size),
         legend.title = element_text(size = font_size),
         legend.key.size = unit(0.5, "lines"),
         axis.title.x = element_text(margin = ggplot2::margin(t = 0, r = 0, b = 0, l = 0), vjust = 1),
         axis.text.x = element_text(margin = ggplot2::margin(t = 1, r = 0, b = 0, l = 0), vjust = 1),
         axis.title.y = element_text(margin = ggplot2::margin(t = 0, r = 0, b = 0, l = 0), angle = 90, vjust = 1),
         axis.text.y = element_text(margin = ggplot2::margin(t = 0, r = 1, b = 0, l = 0), hjust = 1)
theme_set(theme_cowplot_consistent_text(font_size = 8))
Get to R studio server
In browser, go to http://<yourserverIP>:8787
Better R color palette
    cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
    # grey, orange, blue, forest green, banana yellow, navy blue, red, purplypink
My R color palette
 palette <- c("#FF0000","#0072B2","#E69F00","#009E24", "#979797","#5530AA", "#111111")
 #Red, Blue, Orange,Green, Grey, Purple,Black
Sort a vector by another vector function
sort_func <- function(x){ 
        y <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "T", "U", "V", "W", "X", "Y", "Z", "R", "S") 
  z <- unlist(strsplit(x, ""))
  return(paste(z[order(match(z, y))], collapse=""))

 #sort_func("ZRACB") outputs "ABCZR"

 #Apply to a column
 df %>% 
       rowwise %>% 
       mutate(sortedCode = sort_func(Code))
Get r.squared from a linear regression
    a <- summary(lm(a ~ b), data =d)
Get more space between facet panel text and label
   + theme(strip.text = element_text(margin = margin(.1, 0, .2, 0, "cm"))
Make r variables on the fly, and assign them values (very rare to do)
    ct <- 1
    assign(paste("value.", ct, sep=""), 5)
    ct <- ct + 1
    assign(paste("value.", ct, sep=""), 10)
    #value.1 = 5
    #value.2 = 10
Make a bar chart with colored groups
    r.vals <- c(value.1a, value.2a, value.1b, value.2b)  #values
    r.names<-c(r.1, r.2, r.3, r.4)  #names
    groups <- c("group1", "group2", "group1", "group2" #categories
    df <- data.frame(r.square = vals, names = factor(r.names, levels=r.names), groups)
    cbbPalette <- c("#000000", "#E69F00") #black and orange
    a<-ggplot(aes(x = names, y = r.square), data = df) +
        geom_bar(stat = 'identity', aes(fill=rows)) +
        ylab(expression(paste("Variance Explained (R"^"2", ')', sep=''))) +
        xlab("Predictor Variables") + scale_fill_manual(values=cbbPalette) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
        scale_y_continuous(limits = c(0, 0.3))
Make binary black and white heatmap on table of 1’s and blanks
 df <- read.csv(gene, sep="\t", header=TRUE, row.names=1)
 m <- as.matrix(df, rownames.force=TRUE)
 class(m) <- "numeric"
 m[m==""] <- 0
 m[] <- 0
 nr <- nrow(m)
 heatmap.2(m, scale="none", col=c("white", "black"), cexRow=0.2/log10(nr), trace="none", colsep=c(1,2,3,4,5,6,7,8,9,10), sepcolor="grey", sepwidth=0.01, key=FALSE, xlab="DATABASES", ylab="GENES", margins=c(15,10))
Heatmap with overlayed values
    myPanel <- function(x, y, z, ...) {
    panel.text(x, y, round(z,2))
    colors <- colorRampPalette(c('red', 'seashell'))(256)

    # fullcorr is a numeric 15x15 matrix
    hmap <- print(levelplot(fullcorr[,15:1],  xlab="", ylab="",panel = myPanel, col.regions=colors ,scales=list(tck=0, x=list(rot=45,alternating=2)))) ##### Make a multipanel figure with cowplot

    ab <- qplot(a, b)
    bc <- qplot(b, c)
    cd <- qplot(c, d)
    de <- qplot(d, e)
    allplotted <- plot_grid(ab, bc, cd,de, labels = c("A", "B", "C", "D"), ncol = 2)




Indexed from 1 SELECT IFNULL(“colname”, “None”) NOW()

Selecting data
    SELECT * FROM <tablename> select all columns from a table
    WHERE <columname> = "Texas" AND <columnname2> > 5 AND <columname3> in ("USA", "MEX") conditions
    ORDER BY <columname> desc; order the rows descending
String functions (from

concat select concat(‘a’, ‘’, ‘b’); replace select replace(‘a_b_c’, ‘’, ‘.’); trim select trim(‘ 123 ‘);

Numeric functions (from

round select round(78421/100, 2); % (modulus) (If/Else conditional) select if( 587 % 2 = 0, ‘even’, ‘odd’);

Date functions (from

now select now(); year, monthname select year(now()) as year, monthname(now()) as month;

Get data from UCSC genome browser

First, connect to remote select name, chrom, strand, cast(exonStarts as char(100)) as exon_starts_data, convert(exonEnds, char(100)) as exon_ends_data from sacCer3.sgdGene limit 200;

Make a view of the data
    create or replace view my_saved_view as
   select round(avg(GNP * 1000000 / Population), 0) as "Per capita GNP ($)",
   round(std(GNP * 1000000 / Population), 0) as "Std deviation ($)"
    from country;