A site to help Biochemists learn R.

Starting points

Monday, 13 November 2017

Plotting some real chromatography data

Diane Hatziioanou who's working through R for Biochemists 101 from the Biochemical Society shared some of her chromatography data with me. 
Diane has published a similar plot in the journal, Anaerobe

Here is the plot made in ggplot2:


The plot is made without the data reformatting that I did in the previous example. Each line is added separately, scaled and text added and positioned manually. The is no legend. 

The curves are smoother because there is lots more data points. 

The secondary axis on the right hand side of the plot is in response to a question from Diane and is created using this code:

p +   scale_y_continuous(sec.axis = sec_axis(~./4, 
    name="Conductivity (mS/cm)"))

The tilda "~./4" part of this code modifies the numbers on the secondary axis. In this case, it divides them by four. We also supply a name for the axis. 

The data is downloaded from Github and is supplied with Diane's permission - thanks Diane. 

Here is the script that makes it and the intermediate plots:

START
library(readxl)
library(ggplot2)

# Hatziioanou, D, Mayer MJ, Duncan, SH, Flint, HJ & Arjan Narbad, A 
# (2013) Anaerobe 23:5-8
# http://dx.doi.org/10.1016/j.anaerobe.2013.07.006

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/20100212roseburiab.xls"
download.file(url=link, destfile="file.xls", mode="wb")

data_2 <- read_excel("file.xls", skip = 2)

colnames(data_2) <- c("vol_ml", "UV_mAU", "Cond_ml_11_1", "conduct_mS_cm",
  "Cond_11_%_ml", "percent_%", "Conc_11_ml", "percentB_1", "percentB_2",
  "Pressure_ml", "Pressure_ MPa", "Flow_ml", "Flow_ml_min", "Temp_ml", 
  "Temp_C")

# published data is column 2 vs column 7
ggplot(data_2, aes(x = Conc_11_ml, y = UV_mAU)) + geom_line()
# Warning message - missing values

# row 602 gives a volume of almost 20 mls so limit the plot to that
data_2[602,7]
# 19.993

# subset the data - just 602 rows
ggplot(data_2[1:602,], aes(x = Conc_11_ml, y = UV_mAU)) + geom_line()
# output looks good 
# very much like in the paper. 


# manual overlay and text in lots of layers to create a nice graph
# make a canvas and set the x axis
p <- ggplot(data_2[1:602,], aes(x = Conc_11_ml)) 

# add our first line with y = UV which is a measure of protein 
p <- p + geom_line(aes(y = UV_mAU)) 
p


# add our second line with %B - this is a chromatography concept
# we change the scale of the line by multiplying by 10
# define colour as red
p <- p + geom_line(aes(y = percentB_2*10), colour = "red") 
p



# add conductivity as our third line in blue
# we change the scale of the line by multiplying by 4
# define colour as blue
p <- p + geom_line(aes(y = conduct_mS_cm*4), colour = "blue")
p

# add coloured text to explain the lines
# x and y define the places the text arrive and label gives the text
p <- p +  geom_text(mapping = aes(x = 2, y = 400, label = "UV (mAU)"))
p <- p + geom_text(mapping = aes(x = 4, y = 50, label = "0% B"), colour = "red")
p <- p + geom_text(mapping = aes(x = 17.5, y = 1050, label = "100% B"), colour = "red")
p <- p + geom_text(mapping = aes(x = 7, y = 950, label = "Conductivity (mS/cm)"), colour = "blue")
p

# this gives the secondary axis on the right hand side
# the function sec_axis() with a correction factor to change the number
# in this case the correction factor is "/4" meaning divide by four
# we multiplied by four when we plotted the blue line. 
# we also give the scale a name - "Conductivity" and units. 
p <- p +   scale_y_continuous(sec.axis = sec_axis(~./4, 
    name="Conductivity (mS/cm)"))
p


# add a theme to change the background
p <- p + theme_bw()
p

# add labels, titles with a source for the data...
p <- p + labs(x = "Volume (ml)",
    y = "[Protein] (UV mAU)",
    title = "Chromatogram of FPLC purification",
    subtitle = "Source: Hatziioanou et al, (2013) Anaerobe 23:5e8")
p
END



Some Resources:




Thursday, 9 November 2017

Using drawProteins with BioMart to draw schematic of human MAP kinases...

Last week, I submitted my drawProteins R package to Bioconductor. One of the recommended things to do as a Bioconductor developer is to ensure that the package can work with other Bioconductor packages.

I've created this blog post to show how it's possible to use bioMart to pull out the UniProt accession numbers for the Gene Ontology (GO) term, "MAP kinase activity". This has a GO number of GO:0004707.

My script borrows heavily on the biomaRt users guide written by Steffen Durinck, Wolfgang Huber, Mike Smith. My thanks to them.

Using the R script below, I created this protein schematic:



START
# install bioMart if you haven't used it before
# remove the hash tags...
# source("http://www.bioconductor.org/biocLite.R ")
# biocLite()
# biocLite("biomaRt")
library(biomaRt)

# steps
# 1 choose a Mart with useMart()
# 2 choose a dataset with useMart()
# 3. make query with getBM()
# with
# A. filter  -  restriction on the query
# e.g. Interpro ID(s) [e.g. IPR000001]
# e.g. chromosome_name
# B. attributes  -  values we are interested in to retrieve.
# C. values - values you want to

listMarts()
# from https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html

# chosing a database = MART and a dataset - gets more focussed each step...
ensembl = useMart("ensembl",
                  dataset="hsapiens_gene_ensembl")

# Retrieve all entrezgene identifiers and HUGO gene symbols of genes which have
# a “MAP kinase activity” GO term associated with it.
# this is the GO:0004707
getBM(attributes = c('entrezgene','hgnc_symbol'),
      filters = 'go',
      values = 'GO:0004707',
      mart = ensembl)

# this is 14 proteins....

# create output in a dataframe and add uniprotswissprot
# which is the UniProt ID
output <- getBM(attributes = c('unigene',
                               'uniprotswissprot',
                               'hgnc_symbol'),
                filters = 'go',
                values = 'GO:0004707',
                mart = ensembl)


# returns a dataframe... pull out uniprotIDs for drawing...
uniprotIDs <- output$uniprotswissprot
# get rid of blank entries - turn into NA
uniprotIDs[uniprotIDs==""] <- NA
# remove NA
uniprotIDs <- na.omit(uniprotIDs)
# make the IDs characters
uniprotIDs <- as.character(uniprotIDs)
# just the unique ones
uniprotIDs <- unique(uniprotIDs)
# combine into one element
uniprotIDs <- paste(uniprotIDs, collapse = " ")
# this can now be used in drawProteins

# now get features from Uniprot
library(magrittr)

# install drawProteins from Github
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)

uniprotIDs %>%
  drawProteins::get_features() %>%
  drawProteins::feature_to_dataframe() ->
  prot_data
# data frame with 722 observations

library(ggplot2)
# basic drawing
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
draw_repeat(p, prot_data)



p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 4)

# background and y-axis
p <- p + theme_bw(base_size = 20) +  # white background and change text size
  theme(panel.grid.minor=element_blank(),
    panel.grid.major=element_blank()) +
  theme(axis.ticks = element_blank(),
    axis.text.y = element_blank()) +
  theme(panel.border = element_blank())

# add titles
p <- p + labs(title = "Schematic of human MAP kinases",
  subtitle = "circles = phosphorylation sites\nsource:Uniprot")


# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")

p
# AND it works!!

END





Resources and citations:

Monday, 6 November 2017

Making a chromatography plot with an axis on the right hand side...

While working through our online course, R for Biochemists 101 from the Biochemical Society, Diane Hatziioanou made the following comment:
Could you point me in the direction of help to make a plot with different axes? For example for a character series “fractions” make one plot with two sets of data, one plotted against the right and one against the left y axes where the two y axes could be something like absorbance and activity. Thanks!

This inspired me to create this visualisation:

If you more information, leave a comment below.

Here is the script:
START
#https://stackoverflow.com/questions/43942002/geom-line-with-different-y-axis-scale


library(magrittr)
library(tidyr)

sample_data <- data.frame(fracs = rep(1:20),
  Absorbance = c(0.1, 0.1, 0.1, 0.2, 0.4, 0.1, 0.1, 0.5, 1.0, 0.1,
    0.1, 0.1, 0.5, 0.1, 0.1, 0.2, 0.3, 0.5, 1.0, 0.1),
  Activity = c(32, 33, 35, 150, 287, 39, 35, 40, 42, 32,
    30, 31, 32, 45, 57, 32, 33, 40, 41, 40))

# reformat the data into long format to allow plotting
data <- sample_data %>% gather(y, value, Absorbance:Activity)
data$fracs <- as.integer(data$fracs)

# trial plot for Activity (enzyme activity)
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Activity")) 




# trial plot for Absorbance
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance"))



# plot the two lines together
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance")) + 
  geom_line(data=data %>% filter(y=="Activity"))
# but different values makes Absorbance line flat. 





# change the scale for second line
max(sample_data$Absorbance)
max(sample_data$Activity)
scaleby = max(sample_data$Activity)/max(sample_data$Absorbance)

ggplot(data, aes(colour=y)) + 
  geom_line(data=data %>% filter(y=="Activity"),
    aes(x = fracs, y = value)) + 
  geom_line(data=data %>% filter(y=="Absorbance"),
    aes(x = fracs, y=value*scaleby))
# this looks better...


# putting it together with labels etc... 
p <- ggplot() + 
  geom_line(data=data %>% filter(y=="Activity"),
    aes(x = fracs, y = value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance"),
    aes(x = fracs, y=value*scaleby, colour=y))

# limit for y axis
p <- p + ylim(0,300)

# adding our second scale on the right hand side of the plot
p <- p + scale_y_continuous(sec.axis = sec_axis(~./scaleby, 
    name="Absorbance (570nM)"))
  
# label the x and y axis...
p <- p + ylab("Enzyme activity (units)")
p <- p + xlab("Chromatography fractions")


# This is salt chromatography so adding a line showing 
# the NaCl concentration. 
salt_conc <- data.frame(fracs = rep(1:20),
  conc = c(c(10), seq(from = 10, to = 250, by= 13)))

# salt curve labels
salt_label <- data.frame(x = c(1,20), 
  y = c(10, 250), 
  text = c("[10nM]\nNaCl", "[250nM]\nNaCl"))

# check out the line
ggplot(salt_conc, aes(fracs, conc)) + 
  geom_line() + 
  geom_text(data = salt_label,
    aes(x = x,
      y = y,
      label = text))



# linetype=2 gives a dashed line
p <- p + geom_line(data = salt_conc, aes(fracs, conc),
  linetype = 2)
# add some labels to show the starting and ending concentrations
p <- p + geom_text(data = salt_label,
              aes(x = x,
                y = y,
                label = text),
                size = 2.5)

p <- p + theme_bw()
p <- p + ggtitle("Chromatography plot")
p <- p + theme(legend.position="top", legend.direction="horizontal")
p <- p + theme(legend.title=element_blank())
p





END

Thursday, 2 November 2017

Using drawProteins for draw NF-kappaB proteins...

Update,  3 Nov 2017: I've renamed the functions e.g. geom_chains is now draw_chains and added a new function draw_canvas which allows one to draw without the chain, if you want.

Building on what I did previously, more work done on the drawProteins package allows the simplier script below to draw this protein schematic:


SCRIPT START
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
library(ggplot2)

# accession numbers of five NF-kappaB proteins
"Q04206 Q01201 Q04864 P19838 Q00653" %>%
  drawProteins::get_features() %>%
  drawProteins::feature_to_dataframe() ->
  prot_data

p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p <- draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 8)

# background and y-axis
p <- p + theme_bw(base_size = 20) +  # white background and change text size
  theme(panel.grid.minor=element_blank(),
        panel.grid.major=element_blank()) +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  theme(panel.border = element_blank())

# add titles
p <- p + labs(x = "Amino acid number",         # label x-axis
              y = "",  # label y-axis
              title = "Schematic of human NF-kappaB proteins",
              subtitle = "circles = phosphorylation sites\nRHD = Rel Homology Domain\nsource:Uniprot")


# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")

p
SCRIPT END

Resources

Tuesday, 31 October 2017

Chance of dying for males and females using the UK Life Tables

I am currently answering comments to R for Biochemists 101 - online training that I've created for the Biochemical Society. One of the participants asked about overlaying the different chances of dying for male and females inspired by the blog piece Exploring the chance of dying using the UK Life Tables.

As is often the case in R, there is a quick and simple answer which works ok if the data is very self explanatory. However, it's difficult to draw a legend.

There is also a better way that builds on the strengths of ggplot2 but it requires reformatting the data. I've used the melt() function from the reshape package.

Here is the graph made with the better answer:


The code below also shows the quick way...

START
library(readxl)
library(ggplot2)

# The UK National Life Tables
# these are available through the Office of National Statistics
# as an excel file from this page:
# https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies/datasets/nationallifetablesunitedkingdomreferencetables
# the can be downloaded from here:
ons_url <- c("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies/datasets/nationallifetablesunitedkingdomreferencetables/current/nltuk1315reg.xls")
# to guarantee undistrubed access, I've put the file on github
github_url <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/nltuk1315reg.xls"

# the download.file() function downloads and saves the file with the name given
download.file(url=ons_url,destfile="file.xls", mode="wb")
# if this doesn't work try replacing the url with the github_url
# then we can open the file and extract the data using the read_excel() function.
data<- read_excel("file.xls", sheet = 5, skip = 6)

# need to remove four bottom rows - these are all blank.
data <- data[1:101,]

colnames(data)

# x = age in years
# first set is males and second set is females
# mx is the central rate of mortality, defined as the number of deaths at age x last birthday in the three year period to which the National Life Table
# relates divided by the average population at that age over the same period.

# qx is the mortality rate between age x and (x +1), that is the probability that a person aged x exact will die before reaching age (x +1).

# to allow us to separate the male and female data, I am relabelling the column names

colnames(data) <- c("age",
                    "m.mx", "m.qx", "m.lx", "m.dx", "m.ex",
                    "f.mx", "f.qx", "f.lx", "f.dx", "f.ex")


# the life table allow us to draw some interesting curves
# the first one I want to explore is my chance of dying this year.

# qx is the key value.
# the chance of my dying before my next birthday.

# the ending point from the previous graph
# from http://rforbiochemists.blogspot.co.uk/2017/03/exploring-chance-of-dying-using-uk-life.html

p <- ggplot(data = data,
       aes(x = age, y = m.qx)) +
  geom_line(colour = "red") +
  geom_point(colour = "red", size = 0.75) +
  ylab("Chance of dying") +
  xlab("Current Age") +
  ggtitle("Chance of Men Dying before next Birthday") +
  scale_y_log10(
    breaks = c(0.0001, 0.001, 0.01, 0.1, 0.5),  # where to put labels
    labels = c("1/10,000", "1/1000", "1/100", "1/10", "1/2")) +  # the labels
  theme_bw()

# show the object
p





# quick way
# add geom_line() and geom_point() with more data
p +
  geom_line(data = data,
                aes(x = age, y = f.qx),
                colour = "blue") +
  geom_point(data = data,
             aes(x = age, y = f.qx),
             colour = "blue", size = 0.75) +
  # and change title
ggtitle("Chance of Dying before next Birthday")





# However, it's really difficult to add a proper legend

# Longer way - reformat the data into 'long' format
# using melt() function from reshape() package

library(reshape2)
# subset the columns we want - just three
data_s <- data.frame(cbind(data$age, data$m.qx, data$f.qx))
# rename columns
colnames(data_s) <- c("age", "male", "female")

# create new object with melt in long format
data_long <- melt(data_s, id.vars = "age")
# rename columns...
colnames(data_long) <- c("age", "gender", "chance_die")

# now draw the object
# using the colour aesthetics
# you get a legend automatically.
ggplot(data = data_long,
       aes(x = age,
           y = chance_die,
           colour = gender)) +
  geom_line() +
  geom_point() +
  ylab("Chance of dying") +
  xlab("Current Age") +
  ggtitle("Chance of Dying before next Birthday") +
  scale_y_log10(
    breaks = c(0.0001, 0.001, 0.01, 0.1, 0.5),
    labels = c("1/10,000", "1/1000", "1/100", "1/10", "1/2")) + 
  theme_bw()

END of SCRIPT

Resources

Tuesday, 15 August 2017

Drawing NFkappaB protein schematics with ggplot2..

Update (2 Nov 2017): This still works but has been simplified with drawProteins package. See this blog page for the better example.

I've been busy with work and preparing materials for our R for Biochemists 101 online training course for the Biochemical Society. The materials are being tested now and the course should be available in October 2017. Then I had a short holiday...

So, it's taken a little longer than I hoped but here is the code to draw protein schematics with ggplot2. The schematic shows protein schematics of the five human NF-kappaB proteins. Here is the visualisation:




The code uses my drawProteins package which is on Github. This can be installed using the function install_github() from the package devtools. The required syntax is devtools::install_github("brennanpincardiff/drawProteins")

I've gone through the code step by step. It's quite long and I am planning to create some more functions that will shorten the code.

Here is the step by step code that uses drawProteins and ggplot2 to create a protein schematic visualisation.  The ggplots along the way are shown too. To see the plots best requires a plot window of about 900 pixels.

# START
# working with rel proteins as I have been studying these for 10 year
# using ggplot2 to draw the plots.
# proteins are aligned to the N-termini
# alternative alignment options would be worth considering

library(httr)
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
library(ggplot2)

# Uniprot accession numbers of five human NF-kappaB family members
proteins_acc <- c("Q04206 Q01201 Q04864 P19838 Q00653")

# accession numbers need to be combined with "%2C" for Uniprot API
proteins_acc_url <- gsub(" ", "%2C", proteins_acc)

# this is the baseurl for a multiple features query
baseurl <- "https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession="

# make url to GET features for multiple accession numbers
url <- paste0(baseurl, proteins_acc_url)

# basic function is GET() which accesses the API
# accept_json() argument gives a JSON object
prots_feat <- GET(url,
                  accept_json())

status_code(prots_feat)  # returns a 200 - good

# extract content() - gives a list with length = number of acc no
prots_feat_red <- httr::content(prots_feat)


################
# loop to work through the API object and convert to data.frame
# probably there is a better way to do this
features_total_plot <- NULL
for(i in 1:length(prots_feat_red)){
  # the extract_feat_acc() function takes features into a data.frame
  features_temp <- drawProteins::extract_feat_acc(prots_feat_red[[i]])
  features_temp$order <- i  # this order is needed for plotting later
  features_total_plot <- rbind(features_total_plot, features_temp)
}

# look at the data.frame
View(features_total_plot)  # 319 observations


### use ggplot2::geom_rect to draw moleucles
### in this script each type of feature is drawn separately.

## step 1: set up the canvas in ggplot (no data added)
# want to put text on the left in the area of negative numbers
plot_start <- -max(features_total_plot$end)*0.2

# need it to be longer than the longest molecule plus a bit...
plot_end <- max(features_total_plot$end) + max(features_total_plot$end)*0.1

p <- ggplot() +
  ylim(0.5, max(features_total_plot$order)+0.5) +
  xlim(plot_start, plot_end)
p # show the object





## step 2: draw the CHAINS in grey (narrower)
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "CHAIN",],
                   mapping=aes(xmin=begin,
                               xmax=end,
                               ymin=order-0.2,
                               ymax=order+0.2),
                   colour = "black",
                   fill = "grey")
p



## step 3: annotate with names
p <- p +  annotate("text", x = -10, y = features_total_plot$order,
                   label = features_total_plot$entryName,
                   hjust = 1,
                   size = 4)
# check the object
p


## step 4: plot domains fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "DOMAIN",],
                   mapping=aes(xmin=begin,
                               xmax=end,
                               ymin=order-0.25,
                               ymax=order+0.25,
                               fill=description))
p


# label domains with descriptions
feat_domain <- features_total_plot[features_total_plot$type == "DOMAIN", ]
p <- p + geom_label(data = feat_domain,
                    aes(x = begin + (end-begin)/2,
                        y = order,
                        label = description))
p
# this works well with geom_label()


## step 5 plot motifs fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "MOTIF",],
                   mapping=aes(xmin=begin,
                               xmax=end,
                               ymin=order-0.25,
                               ymax=order+0.25,
                               fill=description))

p
# not going to label these as they're in the description and they are small


## step 6 plot repeats fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "REPEAT",],
                   mapping=aes(xmin=begin,
                               xmax=end,
                               ymin=order-0.25,
                               ymax=order+0.25),
                   fill = "lightblue")
p


# label repeats (for this they are ANK but remove digits)
p <- p + geom_text(data = features_total_plot[features_total_plot$type == "REPEAT",],
                    aes(x = begin + (end-begin)/2,
                        y = order,
                        label = gsub("\\d", "", description)),
                   size =2)
p


# add titles
p <- p + labs(x = "Amino acid number",         # label x-axis
              y = "",  # label y-axis
              title = "Schematic of five human NF-kappaB proteins",
              subtitle = "circles = phosphorylation sites\nRHD: Rel Homology Domain\nsource:Uniprot")
p



# white background and remove y-axis
p <- p + theme_bw() +  # white background
         theme(panel.grid.minor=element_blank(),
         panel.grid.major=element_blank()) +
         theme(axis.ticks = element_blank(),
               axis.text.y = element_blank()) +
         theme(panel.border = element_blank())
p


# add phosphorlation sites
# extract info about sites using phospho_site_info() function
phospho_sites <- drawProteins::phospho_site_info(features_total_plot)

# draw phosphorylation sites on the protein with geom_point()
p <- p + geom_point(data = phospho_sites,
               aes(x = begin,
                   y = order+0.25),
               shape = 21,
               colour = "black",
               fill = "yellow", size = 4, stroke = 1)

# to see it properly you need to zoom...

# I have to admit that I am pleased with this diagram.
# The colours are better, the labels seem to work.

p + ggsave("five_human_nfkappab_proteins_20170814.pdf")
# END


The output looks best as a PDF, I think so that's why I've saved it.
There are lots of options for altering text size and the like.
I am still working on visualising protein schematics in R so if you have feedback, it would be most welcome.



Resources