A site to help Biochemists learn R.

R for Biochemists is attending and speaking at Battle of the Beards , Cardiff, March 29, 2017 and planning R training to Namabia as part of Project Phoenix

Starting points

Wednesday, 28 June 2017

Using the Uniprot API to illustrate protein domains...

I'm a member of the Cardiff R User group and we're using R to explore APIs - application programming interfaces which allow R to access data. I chose the explore the API that allows me to download data from Uniprot, "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".

Tomorrow, we have our show and tell about what we have learned about using APIs with R. So in preparation and inspired by Stef Locke, the lead organiser for the Cardiff R User group, I have now created my first R package. The R package, called drawProteins, uses the API output to create a schematic protein domains.

Here are two examples of the output. The first is a schematic of a single protein using Base R plotting (script below). The second example is a schematic of three proteins generated using ggplot2 (script in a little while).

Here is the script to plot a schematic of the transcription factor p65 (top panel):

# OK so I have to write some kind of a script to use the new 
# Uniprot API
# This is R-UserGroup homework for Thursday....

# references: https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/search
# references: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx237

# needs the package httr
# install.packages("httr")

# my package with functions for extracting and graphing
# might require: install.packages("devtools")

# so I want to use the Protein API 
uniprot_acc <- c("Q04206")  # change this for your fav protein

# Get UniProt entry by accession 
acc_uniprot_url <- c("https://www.ebi.ac.uk/proteins/api/proteins?accession=")

comb_acc_api <- paste0(acc_uniprot_url, uniprot_acc)

# basic function is GET() which accesses the API
# requires internet access
protein <- GET(comb_acc_api,

status_code(protein)  # returns a 200 means it worked

# use content() function from httr to give us a list 
protein_json <- content(protein) # gives a Large list
# with 14 primary parts and lots of bits inside

# function from my package to extract names of protein
names <- drawProteins::extract_names(protein_json)

# I like a visualistion so I want to draw the features of this molecule
# https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=Q04206

# Get features 
feat_api_url <- c("https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=")

comb_acc_api <- paste0(feat_api_url, uniprot_acc)

# basic function is GET() which accesses the API
prot_feat <- GET(comb_acc_api,

status_code(prot_feat)  # returns a 200. 

# so to process this into a data.frame
prot_feat %>%
  content() %>%
  flatten() %>%
  drawProteins::extract_feat_acc() -> 

# clean up for plotting
# focus on those that we want to plot...
features_plot <- features[features$length > 0, ]  
features_plot <- features_plot[complete.cases(features_plot),]
features_plot <- features_plot[features_plot$type!= "CONFLICT",]

# add colours 
library(randomcoloR)   # might need install
colours <- distinctColorPalette(nrow(features_plot)-1)
features_plot$col <- c("white", colours)

# now draw this...  with BaseR
# here is a function that does that....
drawProteins::draw_mol_horiz(names, features_plot)

# add phosphorylation sites
# get sites first. 
features %>%
  drawProteins::phospho_site_info() %>%
  drawProteins::draw_phosphosites(5, "yellow")  # 5 circle radius
text(250,0, "Yellow circles = phosphorylation sites", cex=0.8)



Thursday, 8 June 2017

Making a Namibia choropleth

According to Wikipedia, a choropleth map uses differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a particular quantity in those areas. I thought it would be interesting to make one of these in R for Namibia. I'm going to Namibia on Saturday as part of Project Phoenix, a partnership project between Cardiff University and University of Namibia.

I have used some data about urbanisation in Namibia which I downloaded from the Open Data for Africa website. This data shows some nice variation across the regions of Namibia.

Here is an example of the what can be done:

Here is the code and maps made along the way:

## map urbanisation data onto Namibia administrative regions. 

## step 1: download the map data

## download the file...
link <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/NAM_adm1.rds"
download.file(url=link, destfile="file.rda", mode="wb")
gadmNab <- readRDS("file.rda")

## step 2: simplify the level of detail...
gadmNabS <- ms_simplify(gadmNab, keep = 0.01)

## step 3: turn it into a ggplot object and plot
namMapsimple <- tidy(gadmNabS, region="NAME_1")
ggplot(data = namMapsimple,
       aes(x = long, y = lat, group = group)) + 
       geom_polygon(aes(fill = "red"))

# check region names
# the exclamation mark before Karas will cause a problem. 
namMapsimple$id <- gsub("!", "", namMapsimple$id)
# check it worked
# now no exclamation mark

## step 4: let's get our data and bind it...
# download the data about urbanisation
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/learnR/master/data/Namibia_region_urbanRural.csv")
data <- read.csv(text = x)
# exclude whole country
data_regions <- data[!data$region=="Namibia",]
# so extact 2011 data from the data_regions
data_regions2011 <- data_regions[data_regions$Date == 2011, ]
# just extract urban
data_regions2011_urban <- data_regions2011[data_regions2011$residence == "Urban",]
# change colnames so that we can merge by id...
data_col_names <- colnames(data_regions2011_urban)
data_col_names[1] <- c("id")
colnames(data_regions2011_urban) <- data_col_names

# bind the urbanisation data into map with merge
mapNabVals <- merge(namMapsimple, 
                    by.y = 'id')

p <- ggplot() +     # sometimes don't add any aes...
     geom_polygon(data = mapNabVals,
                  aes(x = long, y = lat,
                  group = group,
                  fill = Value))

# plot the outlines of the administrative regions
p <- p + geom_path(data=namMapsimple, 
              aes(x=long, y=lat, group=group))

p <- p + scale_fill_gradient("Urban\n pop\n  %",
                      low = "white", high = "red",
                      limits=c(0, 100))

p <- p +   coord_map()  # plot with correct mercator projection

# add titles 
p <- p + labs(x = "",         # label x-axis
                y = "",  # label y-axis
                title = "Urbanisation in Namibia 2011",
                subtitle = "source: http://namibia.opendataforafrica.org")
p + theme_bw()

# maybe get rid of long, lat etc..

p <- p + theme_bw() +
  theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) + 
  theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank()) + 
  theme(panel.border = element_blank())
p # show the object

## add names of regions
# calculate centroids
trueCentroids = gCentroid(gadmNabS,byid=TRUE)
# make a data frame
name_df <- data.frame(gadmNab@data[,'NAME_1'])
centroids_df <- data.frame(trueCentroids)
name_df <- cbind(name_df, centroids_df)
colnames(name_df) <- c("region", "long", "lat")

# label the regions
p <- p + geom_text(data = name_df,
                     aes(label = region,
                         x = long,
                         y = lat), size = 2)

# change different colours
p + scale_fill_gradient("Urban\n population\n  %",
                        low = "white", high = "blue")

p + scale_fill_gradient("Urban\n pop\n  %",
                        low = "white", high = "green")

# If you want to save this... export or use the ggave() function
p + ggsave("Namibia_Urbanisation_Map_2011.pdf")




Sunday, 30 April 2017

Three maps of Namibia...

Two colleagues, Dr Mark Kelson and Dr David Gillipse, and I are going to Namibia as part of Project Phoenix. We are planning three days of R training. The materials are being developed and shared on Github as an open source teaching and learning resource.

Before I travel, I'm trying to learn a little more about where I'm going. As a first step, I've been drawing some maps of Namibia. Here are some of the maps:

Colourful map made by library(maps) - Map 1 below.

Map of regions of Namibia made by library(sp) - map 2 below

Map made with library(ggplot2) - map 3 below
I like using ggplot2 but, of course, there is a lot of personal preference. My next step is to add some data to a map... 


# exploring various ways to draw Namibia in R 
# three ways....

## MAP 1 with library(maps)
library(maps)       # Provides functions that let us plot the maps
library(mapdata)    # Contains the hi-resolution points that mark out the countries.

# https://cran.r-project.org/web/packages/maps/maps.pdf
# gives a nice outline
# add some context by adding more countries
    c('Namibia', 'South Africa', 'Botswana', 'Angola', 
      'Zambia', 'Zimbabwe', 'Mozambique'))

# show all of Africa
# map the world but limit by longitude and latitude
    xlim=c(-20,60),  # longitude
    ylim=c(-35,40))  # latitude

# focus on countries in Southern Africa

# add colour and a title
    fill = TRUE,
    col = c("red", "palegreen", "lightblue", "orange", "pink"))
title(main = "Simple map of Southern Africa")
## MAP 2 with library(sp) and downloaded GADM data 
# this allows us to plot administrative regions inside Namibia
# https://www.students.ncl.ac.uk/keith.newman/r/maps-in-r-using-gadm
# http://biogeo.ucdavis.edu/data/gadm2.8/rds/NAM_adm1.rds


# download the file 
link <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/NAM_adm1.rds"
download.file(url=link, destfile="file.rda", mode="wb")
gadmNab <- readRDS("file.rda")  # special read file of format RDA

# plot with colour filled in, title and label. 
plot(gadmNab, col = 'lightgrey', border = 'darkgrey')
title(main = "Map of admin regions inside Namibia")
points(17.0658, -22.5609,col=2,pch=18)
text(17, -22, "Windhoek")
## MAP 3 with library(ggplot2)
# the advantage is that we can add layers to our plot
# it gives us control
# use map_data()function to create a data.frame containing the outline of Namibia
nab_map<-map_data("worldHires", region = "Namibia")

# create the object m with the map in it. 
# colour Namibia (grey)
m <- ggplot() +
  geom_polygon(data=nab_map,   # geom_polygon draw shape fill
               aes(x=long,     # longitude
                   y=lat,      # latitude
m  # show the object

# modify the object to add a thick black line
m <- m + geom_path(data=nab_map,    # geom_path draw shape outline 
              aes(x=long, y=lat, group=group),
              colour="black",  # colour of line
              lwd = 1.5)       # thickness
m    # show it again

# add the points on the plot telling us where the longitude and latitude of Windhoek
# note the data is coming from a different data.frame to the map
location <- data.frame(c(17), c(-22), c("Windhoek"))
colnames(location) <- c("longitude", "latitude", "location")

# add a point with the location
m <- m +  geom_point(data=location, 
                     aes(x=longitude, y=latitude), 
                     colour="red", size = 5) 
m  # show the object

# add text at the same point
m <- m + geom_text(data = location, 
                       label = location))  # label has the text

# change the theme to make the map nicer
m <- m + theme_bw()

# add a title
m + ggtitle("Map of Namibia in ggplot")

# see here for more about colours in R
# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf


Thursday, 27 April 2017

Scraping and visualising global vaccination recommendations with R...

As part of my job, I'm required to check the TravelHealthPro website which contains information about the recommended vaccinations when visiting other countries. For example, for a visit to Namibia, TravelHealthPro advises that most visitors have vaccinations to:
  • Hepatitis A
  • Tetanus
  • Typhoid
When the Cardiff R User Group decided to discuss and work on web scraping, I decided to scrape the TravelHealthPro web site and prepare some maps with selected recommended vaccinations. 

Here are some of the world maps, I've prepared:

The script involved some webscraping, some data reformatting and then the visualisations. Here is all the R script involved in making these maps. It was originally prepared as separate scripts to keep it easier for me to understand but I have combined them all here in parts. 

For just the mapping part, it is possible to start halfway with a download from Github of the scraped data (as of April 26th, 2017). To do that start at Part 3.

Feedback welcome as always. 

library("tools")  # uses the toTitleCase() function
library("rvest")  # tidyverse webscraping package
library("RCurl")  # for downloading data from github
library("rworldmap")  # for the maps

## PART 1: scrape the list of countries from TravelPro Home page
# scrape the page.
# this is the URL
url <- c("http://travelhealthpro.org.uk/countries")

# readLines() is a base R function which allows reading html from pages
data <- readLines(url)
print(paste(url, "has just been scraped"))

# identify where the proteins names are on the scraped data
country_numbers <- grep("travelhealthpro.org.uk/country/", data)
# 317 long... sounds like a good length

# function for removing html tags
# based on http://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r
extractCountry <- function(htmlString) {
  htmlString <- gsub("<.*?>", "", htmlString)
  htmlString <- gsub("\t", "", htmlString)


# function for extracting the url for each country
extractCountryUrl <- function(htmlString) {
  htmlString <- gsub("\\t<li><a href=", "", htmlString)
  htmlString <- gsub("</a></li>", "", htmlString)
  htmlString <- gsub("\"", "", htmlString)
  htmlString <- gsub(">.*", "", htmlString)
# test the function

# loop through the character vector and create a vector of countries
# and a vectors of URLs
countryList <- NULL
countryUrls <- NULL
for(i in 1:length(country_numbers)){
  reqd_url <- extractCountryUrl(data[country_numbers[i]])
  country <- extractCountry(data[country_numbers[i]])
  countryList <- c(countryList, country)
  countryUrls <- c(countryUrls, reqd_url)

# I have a list of countries and a list of URLs
countryList # up to 276 looks ok - Zimbabwe should be last
# truncate at Zimbabwe
zimb <- grep("Zimbabwe", countryList)
countryList <- countryList[1:zimb]
countryUrls <- countryUrls[1:zimb]

## PART 2: with list of countries, extract vaccination info... 
# I am attempting to scrape vaccinations recommendations
# from a Travel Health Pro website on a particular country.
# the example here is travel health pro page about afghanistan
# this is the page
# http://travelhealthpro.org.uk/country/1/afghanistan#Vaccine_recommendations

# scraping the webpage is all in this function
extractVacs <- function(x){
  # scrape the page  
  web_content <- read_html(curl(x, handle = new_handle("useragent" = "Chrome")))
  # handle is required as extra data and curl package is required too 
  print(paste(x, "has just been scraped"))
  # extracting the data using pipes 
  vac_list_most <- web_content %>% 
    html_node(".accordion") %>%     # Note this is html_node - first one (Most Travellers)
    html_nodes(".accordion-item")  %>%
    html_node("p") %>%
    html_text(trim = FALSE)
  vac_list_some <- web_content %>% 
    html_nodes(".accordion") %>%     # Note this is html_nodes - all vaccinations!
    html_nodes(".accordion-item")  %>%
    html_node("p") %>%
    html_text(trim = FALSE)
  # using gsub to remove the spaces and the line brake symbol
  vac_list_most <- gsub("\n", "", vac_list_most)
  vac_list_most <- gsub("  ", "", vac_list_most)
  vac_list_some <- gsub("\n", "", vac_list_some)
  vac_list_some <- gsub("  ", "", vac_list_some)
  #substract vac_list_most from vac_list_some
  vac_list_some <- setdiff(vac_list_some, vac_list_most)
  countryName <- gsub("http://travelhealthpro.org.uk/country/", "", x)
  countryName <- gsub("[[:digit:]]+", "", countryName)
  countryName <- gsub("/", "", countryName)
  countryName <- toTitleCase(countryName)
  # make the list with country name and the two vaccinations lists...
  vac_list <- list(country = countryName, vac_most = vac_list_most, vac_some = vac_list_some)
  # this works and returns a list. 

# test the code
vac_list_example <- extractVacs(countryUrls[2])
# works

# for demo purposes - just scrape four 
output_demo <- lapply(countryUrls[7:10], extractVacs)

# to scrape all the countries, remove the hash tag and run the whole thing
# output <- lapply(countryUrls, extractVacs)

# it's a good plan to save this file so that it doesn't have to be scraped again
# save(output, file = "vaccinationList_scraped20170426")
# and save the country list 
# save(countryList, file = "countryList_scraped20170426")

## PART 3: If you don't want to scrape the data - download it...

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/countryList_scraped20170426.rda"
download.file(url=link, destfile="file.rda", mode="wb")
countryList <- readRDS("file.rda")

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/vaccinationList_scraped20170426.rda"
download.file(url=link, destfile="file.rda", mode="wb")
output <- readRDS("file.rda")

# add "Ivory Coast" to the country List for future binding into rworldmap
countryList <- c(countryList, "Ivory Coast")
# add a new element to list with Ivory Coast 
ivory <- output[grep("Ivory", output)]
ivory[[1]]$country <- c("Ivory Coast")
output <- c(output, ivory[1])  # just one square bracket N.B.

# so we have a list of vaccinations and countries that is 277 long

# have a list of 277 with the vaccinations in...
# and looks good. 

# next step - turn these two objects into something to do a visualiation. 

## PART 4: Clean up vaccinations list

# get the whole unique list of vaccinations/drugs is not trivial
# inconsistent labelling on the site

vac_req_mostTrav <- unique(unlist(lapply(output, '[[', 2)))
# white space issue for Hepatitis A

vac_req_someTrav <- unique(unlist(lapply(output, '[[', 3)))
# trailing white space and capitalisation issues for TBE
# "Tick-Borne Encephalitis (TBE)"
# "Tick-Borne Encephalitis (TBE) "
# "Tick-borne encephalitis (TBE)"
# "Tick-borne encephalitis (TBE) " 

# apply toTitleCase and some gsubs in each element in output
for(i in 1:length(output)){
  output[[i]] <- lapply(output[[i]], toTitleCase)
  output[[i]]$country <- gsub("-", " ", output[[i]]$country)
  output[[i]]$vac_most <- gsub("Hepatitis a", "Hepatitis A", output[[i]]$vac_most)
  output[[i]]$vac_some <- gsub("Hepatitis a", "Hepatitis A", output[[i]]$vac_some)
  output[[i]]$vac_some <- gsub("Tick-Borne Encephalitis (TBE) ", 
                               "Tick-Borne Encephalitis (TBE)", 
# toTitleCase will change Hepatitis A to Hepatitis a so change it back...

## PART 5: Pull out countries with vaccinations we're interested in...
# want a data frame with country in one column
# then I want a column entitled Hep A, Polio etc...

vac_df <- data.frame(countryList, stringsAsFactors = FALSE)
colnames(vac_df) <- c("country")

# go through each element in the list (lapply)
# use a function to see if the vaccination I want is required...

# example: Polio
polioStatus <- lapply(1:length(output), function(x){"Polio" %in% output[[x]]$vac_most})
vac_df$polio <- unlist(polioStatus)
vac_df$polio_val <- as.numeric(vac_df$polio)

# Yellow Fever
yellFevStatus <- lapply(1:length(output), function(x){"Yellow Fever" %in% output[[x]]$vac_most})
vac_df$yellFev <- unlist(yellFevStatus)
vac_df$yellFev_val <- as.numeric(vac_df$yellFev)

# Tick-Borne Encephalitis (TBE)
tbeStatus <- lapply(1:length(output), function(x){"Tick-Borne Encephalitis (TBE)" %in% output[[x]]$vac_some})
vac_df$tbe <- unlist(tbeStatus)
vac_df$tbe_val <- as.numeric(vac_df$tbe)

## PART 6: join the data into the countries in rworldmap package
#join to a coarse resolution map
spdf <- joinCountryData2Map(vac_df, joinCode="NAME", nameJoinColumn="country")

# 211 codes from your data successfully matched countries in the map
# 66 codes from your data failed to match with a country code in the map
# 32 codes from the map weren't represented in your data

# where polio vaccination is recommended to most travellers 
mapCountryData(spdf, nameColumnToPlot="polio_val",
               addLegend = FALSE,
               mapTitle = "Polio Vaccination Recommended")
text(0,-90, "Source: http://travelhealthpro.org.uk/")

# where yellow fever vaccination is recommended to most travellers 
mapCountryData(spdf, nameColumnToPlot="yellFev_val",
               addLegend = FALSE,
               mapTitle = "Yellow Fever Vaccination Recommended")
text(0,-90, "Source: http://travelhealthpro.org.uk/")

# where Tick-Borne Encephalitis vaccination is recommended to some travellers
mapCountryData(spdf, nameColumnToPlot="tbe_val",
               addLegend = FALSE,
               mapTitle = "Tick-Borne Encephalitis \nVaccination Recommended (some travellers)")
text(0,-90, "Source: http://travelhealthpro.org.uk/")

# zoom on Africa for Yellow Fever data
mapCountryData(spdf, nameColumnToPlot="yellFev_val",
               mapRegion = "Africa",
               addLegend = FALSE,
               mapTitle = "Yellow Fever Vaccination Recommended")
text(10,-35, "Source: http://travelhealthpro.org.uk/")

Useful Resources and further reading