# install the package
install.packages("xlsx", repos = "https://cran.rstudio.com/")
# CRAN is the repository by default but it can be optionally indicated
R4: Functions in R
1 R packages
As you already know, R is a ‘GNU Software’ with a GPL license. As a a freely available language it has a great community of users from diverse background and interests. This community has developed a myriad of applications for R, called R Packages. The idea behind R packages is that the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them (help), and some sample data to perform tests.
Packages can be installed from one of the public R repositories. In this course we will mainly use two R repos, CRAN and Bioconductor. CRAN stands for “the Comprehensive R Archive Network”, and it contains a huge variety of packages free to use. On the other hand, as we will see in more detail later on the Lesson R9, Bioconductor is a repository of software devoted to bioinformatics or computational biology applications. As for June 2023, the CRAN package repository features 19,622 available packages whereas Bioconductor release 3.16 contains 2,183 packages.
A full list of CRAN packages can be found here and a list sorted by topic here.
1.1 Installing and use R packages
As an example, we are going to install and use xlsx
, a handy package to import/export tables in MS Excel format. You may also check related packages, such as readxl
or pzfx
.
Some R packages may have additional requirements. This is something that is not very common and is outside of R or RStudio, as it depends on how the package was developed. Therefore, there is no general rule or recommendation here.
After testing with students in class, I realized that the xlsx
package requires the path of your Java executable. There are a fewwaysyou could fix it, but my suggestion would be to use a package that does not have depend on Java. Now that you know how to install and load packages, you can also try other packages for reading and writing XLS files, such as openxlsxor readxl.
You must load the package before using it.
# two ways of loading the packages:
# 1
library(xlsx)
# 2
require(xlsx)
# help
help(xlsx)
# you can also write in the console ??xlsx
# you can use an if statement to install package only if not installed
if (!require(xlsx)) {
install.packages("xlsx")
library(xlsx)
}# unload the package (optionally)
detach("package:xlsx", unload = TRUE)
Optionally, if you want to unload a package from memory, you can use detach()
. This can be useful for long scripts.
In the above code we have used require()
and library()
functions to call for package loading. These are very similar functions that are often interchangeable. The main difference is that when you use require()
and it cannot be executed, you will get a warning (see below for how to use warning), but not an error. Therefore, your code will always continue to execute if possible. In the following example, we tried to load an unknown package called uam
.
library(uam)
Error in library(uam): there is no package called 'uam'
require(uam)
Loading required package: uam
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'uam'
In the next example, we are going to read the file coli_genomes_renamed.csv, that we used in the lesson R3 and export it ready for MS Excel using the package xlsx
.
# open it again
<- read.csv(file = "data/coli_genomes_renamed.csv")
coli_genomes # save
library(xlsx)
write.xlsx(coli_genomes, "data/coli_genomes.xlsx")
Many packages in CRAN also contain a reference manual and some of them also a vignette. A vignette is practical guide to each package. You can see all the installed vignettes with browseVignettes()
or just vignette()
for the primary vignette. You can find a bunch of tutorials and tricks about how to use popular packages, but the vignette is an official and complete reference that is always helpful.
browseVignettes("xlsx")
starting httpd help server ... done
browseVignettes("seqinr")
No vignettes found by browseVignettes("seqinr")
browseVignettes("ggplot2")
Note that in some cases, as in the package seqinr, there is no official vignette at CRAN. However, as R is an open access language, you will easily find information about any package or function in other websites, such as rdrr.io, rdocumentation.org or stackoverflow.com, among others.
1.2 Too many packages
As there are a lot of packages sometimes it takes some time to test different packages for the same analysis. Also, some packages become obsolete with a new R or RStudio version. Thus, is also a good habit to uninstall packages, with the function remove.package()
.
Sometimes, you can come into two different namesake functions from independent packages. Also, to reduce the memory load you may want to use a function without loading the package. In those cases, as in the examples below, there is a trick to call a specific function, the prefix package::
. However, it should be noted that in some packages the syntax is complex and you need to call more than one function to actually use a major function.
# package getwiki
install.packages("getwiki")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
vignette("getwiki")
<- getwiki::search_wiki("DNA")
DNA str(DNA)
'data.frame': 20 obs. of 2 variables:
$ titles : chr "A-DNA" "Circular DNA" "Complementary DNA" "DNA" ...
$ content: chr "A-DNA is one of the possible double helical structures which DNA can adopt. A-DNA is thought to be one of three"| __truncated__ "Circular DNA is DNA that forms a closed loop and has no ends. Examples include:Plasmids, mobile genetic elemen"| __truncated__ "In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA)"| __truncated__ "Deoxyribonucleic acid ( ; DNA) is a polymer composed of two polynucleotide chains that coil around each other t"| __truncated__ ...
$titles DNA
[1] "A-DNA" "Circular DNA" "Complementary DNA"
[4] "DNA" "DNA (disambiguation)" "DNA extraction"
[7] "DNA ligase" "DNA methylation" "DNA polymerase"
[10] "DNA profiling" "DNA repair" "DNA replication"
[13] "DNA sequencing" "DNA virus" "DNA²"
[16] "DNA–DNA hybridization" "DnaA" "Environmental DNA"
[19] "Mitochondrial DNA" "Recombinant DNA"
In the four examples below, you will see that ggplot()
function can be used if you call the package with the prefix ggplot::
. However, to make a plot you need several functions and you need to either explicit the prefix for all the required functions (option #3) or load the package, as in #4. Check the Lesson R7 for more info on ggplotting.
# package ggplot2 1: no package load or calling with prefix
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point()
Error in ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)): could not find function "ggplot"
# 2: call the package with the prefix at the begining of the line
::ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point() ggplot2
Error in aes(Petal.Length, Petal.Width, colour = Species): could not find function "aes"
# 3: prefix in all functions
::ggplot(iris, ggplot2::aes(Petal.Length, Petal.Width, colour = Species)) +
ggplot2::geom_point() ggplot2
# 4: load the package
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point()
Note that in the example we have plotted some data from the dataset named iris. Did you load that data? No need. R has some built-in datasets that can be used as examples for plots or analysis. You can check all of them using the function data()
.
1.3 Quick exercise (I)
- Install the package
report
and check its vignette and the info at the package site: https://easystats.github.io/report/
install.packages("report")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(report)
# look for the vignette
vignette(report)
Error in topic[1L]: object of type 'closure' is not subsettable
# no vignette for report! Check also the CRAN site for further
# documentation.
- As we mentioned in the example with
ggplot
above, R has some built-in datasets that can be used as examples. Another of those datasets is called DNase and it contains data of several experiments of DNase treatment. Explore data and use thereport
package documentation and use it to test correlation between the DNase variablesconc
anddensity
.
report(DNase)
The data contains 176 observations of the following 3 variables:
- Run: 11 levels, namely 10 (n = 16, 9.09%), 11 (n = 16, 9.09%), 9 (n = 16,
9.09%), 1 (n = 16, 9.09%), 4 (n = 16, 9.09%), 8 (n = 16, 9.09%), 5 (n = 16,
9.09%), 7 (n = 16, 9.09%), 6 (n = 16, 9.09%), 2 (n = 16, 9.09%) and 3 (n = 16,
9.09%)
- conc: n = 176, Mean = 3.11, SD = 4.06, Median = 1.17, MAD = 1.56, range:
[0.05, 12.50], Skewness = 1.46, Kurtosis = 0.84, 0% missing
- density: n = 176, Mean = 0.72, SD = 0.60, Median = 0.53, MAD = 0.62, range:
[0.01, 2.00], Skewness = 0.58, Kurtosis = -1.07, 0% missing
report(cor.test(DNase$conc, DNase$density))
Effect sizes were labelled following Funder's (2019) recommendations.
The Pearson's product-moment correlation between DNase$conc and DNase$density
is positive, statistically significant, and very large (r = 0.93, 95% CI [0.91,
0.95], t(174) = 33.64, p < .001)
2 Create user-written functions
2.1 Why? How?
We have discussed throughout the last lectures how R can help you saving time when you need to analyze and plot the data from your experiment. However, many times, particularly in Bioinformatics, you won’t have the data from one single experiment but from many of them.
Creating you own function will be very useful for automation of repetitive analyses or to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions. Functions are also often written when code must be shared with others or the public.
In R, functions are also considered as objects. That means that (1) they can be nested, so you can define a function inside another function and (2) you can use functions as arguments to other functions. We will see very useful examples of this latter feature in Lesson R6, using custom functions as arguments for aggregate()
, xtabs()
, lapply()
or sapply()
.
The overall scheme of an R function is the following:
my_function <- function(argument1, argument2,...){
statements
return(object)
}
2.2 My first simple functions
We are going to learn with some examples from a good online tutorial. First, a quite simple function can simply help with calculations:
# my first function
<- function(x) {
myFunction <- x^2 * 4 + x/3
f return(f)
}# we try it
myFunction(4)
[1] 65.33333
myFunction(0)
[1] 0
myFunction(22)
[1] 1943.333
myFunction(3)
[1] 37
We can include conditions, loops… Another example can be a function to identify even/odd numbers:
# A simple R function to check whether x is even or odd
<- function(x) {
evenOdd if (x%%2 == 0) {
return("even")
else {
} return("odd")
}
}# test
evenOdd(4)
[1] "even"
evenOdd(3)
[1] "odd"
# no need the curly braces!
<- function(x) {
evenOdd2 if (x%%2 == 0)
return("even") else return("odd")
}evenOdd2(4)
[1] "even"
evenOdd2(3)
[1] "odd"
evenOdd2(7)
[1] "odd"
evenOdd2(8)
[1] "even"
In the above example, we found out that curly braces can be omitted sometimes in if statements or loops (see https://www.learnbyexample.org/r-for-loop/#for-loop-without-curly-braces). This modification makes the code handier, but also riskier, use it carefully. Remember, a great power entails a great responsibility. The same applies to functions. Thus, sometimes when creating an R script, you want to create a small function and use it just once. That happens usually when you want to use your own functions to parse data within an apply
family function (see Lesson R6). To deal with those situations, you can use the inline function. To create an inline function you have to use the function command with the argument x and then the expression of the function.
Example:
# inline functions
<- function(x) x^2 * 4 + x/3
f
f(4)
[1] 65.33333
f(0)
[1] 0
f(22)
[1] 1943.333
2.3 Quick exercise (II)
Write an inline function that transforms concentrations units: molar (M) into micromolar (µM). Test it with your own R session.
Select the right answer2.4 Functions with multiple arguments
Now, we will create a function in R Language that will take multiple inputs and gives us one output.
# A simple R function to calculate area and perimeter of a rectangle
<- function(length, width) {
area = length * width
area # you may format the output
print(paste("The area of the rectangle is", length, "x", width, "=", area,
"cm²"))
}
area(2, 3) # call the function
[1] "The area of the rectangle is 2 x 3 = 6 cm²"
Notice that the output also can be a vector or a list:
# Now we calculate area and perimeter of a rectangle
<- function(length, width) {
Rectangle = length * width
area = 2 * (length + width)
perimeter
# create an object called result which is a list of area and perimeter
= list(Area = area, Perimeter = perimeter)
result return(result)
}
Rectangle(2, 3)
$Area
[1] 6
$Perimeter
[1] 10
Like in any R function, you can call the arguments by position or by name. Thus, if add the names of the variables when calling the function you can switch the order of the arguments. Also, you can add some default values when you define the function.
# A simple R code to demonstrate passing arguments to a function
<- function(length = 5, width = 4) {
Rectangle = length * width
area return(area)
}
# Case 1:
Rectangle(2, 3)
[1] 6
# Case 2: If you do not want to follow any order, you can include the name
# of the arguments
Rectangle(width = 8, length = 4)
[1] 32
# Case 3: default's values
Rectangle()
[1] 20
2.5 Including checkpoints
Now we are going to try a longer code to solve the following example.
How many arguments should have this function?
Now let’s see a possible way to do it:
# we need to arguments
<- function(samples, category) {
price_calculator <- c(1, 1.15, 2)
categories names(categories) = c("normal", "priority", "urgent")
if (samples < 10) {
<- 19 * samples * categories[which(names(categories) == category)]
price else if (samples < 50) {
} <- 14 * samples * categories[which(names(categories) == category)]
price else if (samples >= 50) {
} <- 10 * samples * categories[which(names(categories) == category)]
price
}paste("El precio es de", price, "euros.")
}price_calculator(samples = 10, category = "normal")
[1] "El precio es de 140 euros."
price_calculator(10, "urgent")
[1] "El precio es de 280 euros."
price_calculator(10, "urgnt")
[1] "El precio es de euros."
Note that which()
gives you a vector with the index or indices of the vector, not the actual values. That’s why we are using categories[]
to obtain the desired subset; in this case, the value of the vector element that accomplish the condition indicated with which.
In the following example, we used a more straightforward approach. We just switched the strings (“normal”, “priority” and “urgent”) for the numeric values, using the function switch()
. You can see more examples of the use of switch()
in this link.
It seems to work, but it is not detecting wrong data input. How could we introduce checkpoints for the introduced arguments?
# new version with checkpoints
<- function(samples, category = "normal" | "priority" | "urgent") {
price_calculator2 <- switch(category, normal = 1, priority = 1.15, urgent = 2)
category if (samples < 10) {
<- 19 * samples * category
price else if (samples < 50) {
} <- 14 * samples * category
price else if (samples >= 50) {
} <- 10 * samples * category
price
}ifelse(length(price) > 0, return(price), stop("Prioridad incorecta. No se ha podido calcular el precio"))
}price_calculator2(10, "normal")
[1] 140
price_calculator2(10, "urgent")
[1] 280
price_calculator2(10, "urgnt")
Error in ifelse(length(price) > 0, return(price), stop("Prioridad incorecta. No se ha podido calcular el precio")): Prioridad incorecta. No se ha podido calcular el precio
price_calculator2(5.3, "normal")
[1] 100.7
# WTF?
We just noticed that the function calculated the price for 5.3 samples, which is nonsense. We should then introduce a new checkpoint for the format of the introduced value for the variable samples.
# alternative with checkpoint for number of samples
<- function(samples, category = "normal" | "priority" | "urgent") {
price_calculator3 <- switch(category, normal = 1, priority = 1.15, urgent = 2)
category if (abs(floor(samples)) != samples) {
# check that number of samples is an integer number
stop("Número de muestras incorrecto")
}if (samples < 10) {
<- 19 * samples * category
price else if (samples < 50) {
} <- 14 * samples * category
price else if (samples >= 50) {
} <- 10 * samples * category
price
}ifelse(length(price) > 0, return(price), stop("Prioridad incorecta. No se ha podido calcular el precio"))
}
# test again
price_calculator3(50, "urgente")
Error in ifelse(length(price) > 0, return(price), stop("Prioridad incorecta. No se ha podido calcular el precio")): Prioridad incorecta. No se ha podido calcular el precio
price_calculator3(50, "urgent")
[1] 1000
price_calculator3(-5, "normal")
Error in price_calculator3(-5, "normal"): Número de muestras incorrecto
price_calculator3(5.2, "normal")
Error in price_calculator3(5.2, "normal"): Número de muestras incorrecto
2.6 Quick exercise (III)
When creating functions, you can include any R functionality, including reading and parsing external data files. Let’s do a short exercise, within the context of molecular biology.
Can you write a function that convert R into a molecular biology dogma interpreter?
You can use the table genetic_code.csv, located in the data
folder that contains the conversion between all the nucleic acid sequence codons and its encoded amino acid in IUPAC one letter code.
Try it with some examples, including right and wrong codons, like “ATG”, “TAA”, “CAT”, or “AXG”.
# the molecular biology dogma with R
<- function(inputCodon) {
codon2aa <- read.csv2("data/genetic_code.csv", stringsAsFactors = FALSE)
code <- code$AA[code$Codon == inputCodon]
aa return(aa)
}# now let's try it
codon2aa("ATG")
[1] "M"
codon2aa("TAA")
[1] "*"
codon2aa("CAT")
[1] "H"
codon2aa("AXG")
character(0)
Can you also check the value of the variable aa
?
#'aa'??
print(aa)
Error in eval(expr, envir, enclos): object 'aa' not found
What just happened? There are a few things worth to comment here:
If the function cannot find the right value to return, the output is empty:
character(0)
The variable
aa
seems nonexistent! Variables defined in a function are only local variables and cannot be called outside the function.
However, proteins are made up of more than one amino acid, so it’d be great if the input could be a vector of several codons instead a single codon.
Can you add checkpoints and decide how to handle errors?
# version 2
<- function(codons) {
codon2aa_2 <- c()
aa <- read.csv2("data/genetic_code.csv", stringsAsFactors = FALSE)
code for (i in 1:length(codons)) {
# loop over all the elements of the vector 'codons'
stopifnot(`Uno o más de los codones no es correcto. No se ha podido traducir ;)` = codons[i] %in%
$Codon) #check for correct values
code<- code$AA[code$Codon == codons[i]]
aa[i]
}return(aa)
}
# let's try it
codon2aa_2(c("ATG", "TGA"))
[1] "M" "*"
codon2aa_2(c("ARG", "TGA"))
Error in codon2aa_2(c("ARG", "TGA")): Uno o más de los codones no es correcto. No se ha podido traducir ;)
codon2aa_2(c("ATG", "CAT", "CAT", "AAA", "TAA"))
[1] "M" "H" "H" "K" "*"
In this second example, aa
is not a numeric variable (=vector of 1 element), but a “normal vector”, so we need to define it before using in the loop. Also, we have used the function stopifnot()
to check for the codons. This function is a shortcut for a standard if{}
or ifelse{}
to check the codons and a stop if they are not found.
2.7 Quick exercise (IV)
Write a function that calculate your approximate age in months.
`?`(as.Date)
`?`(difftime)
# See some examples:
# https://www.geeksforgeeks.org/calculate-time-difference-between-dates-in-r-programming-difftime-function/
# quickest way I found
<- function(x) {
age if (!is.na(as.Date(as.character(x), format = "%Y-%m-%d"))) {
<- as.numeric(Sys.Date() - as.Date(x, format = "%Y-%m-%d")) #result in days
age print(paste("You are about", round(age/30, 1), "months old"))
else {
} stop("The input date must be quoted and in format YYYY-MM-DD")
}
}# test
age(18)
Error in age(18): The input date must be quoted and in format YYYY-MM-DD
age("2004-11-17")
[1] "You are about 231.5 months old"
# alternative using interactive input & difftime()
<- function(x) {
df if (missing(x)) {
# check for input date if missing, we can ask the user for input
<- readline(prompt = "Introduce your birthdate in the format YYYY-MM-DD: ")
x
}if (!is.na(as.Date(as.character(x), format = "%Y-%m-%d"))) {
<- difftime(Sys.Date(), as.Date(x, format = "%Y-%m-%d"))
df print(paste0("If you were born on ", x, ", you are about ", round(as.numeric(df)/30,
1), " months old."))
else {
} stop("The input date must be in format YYYY-MM-DD")
}
}df("2004-11-17")
[1] "If you were born on 2004-11-17, you are about 231.5 months old."
df(18)
Error in df(18): The input date must be in format YYYY-MM-DD
df()
Introduce your birthdate in the format YYYY-MM-DD:
Error in df(): The input date must be in format YYYY-MM-DD
3 References
R packages, https://r-pkgs.org/index.html
R programming for data science, https://bookdown.org/rdpeng/rprogdatascience/
Creating functions in Programming in R Swcarpentry, http://swcarpentry.github.io/r-novice-inflammation/02-func-R/index.html
Functions in R programming in GeeksforGeeks: https://www.geeksforgeeks.org/functions-in-r-programming/
Learn R by examples: https://www.learnbyexample.org/ and https://www.learnbyexample.org/r-functions/
4 Extra short exercises
Check the following links for more short exercises about functions in R.
- https://www.r-bloggers.com/2016/02/functions-exercises/
- http://mathcenter.oxford.emory.edu/site/math117/probSetRFunctions/
5 Review Exercises
Write a function that translate any nucleotide sequence into a protein sequence.
Hints:
You need to check for the number of nucleotides (i) and also the number of codons (j = i / 3) in the sequence.
You may use the function
substr()
(check the help) to divide the sequence into codons. To do so, you may use a loop that split the nucleotides in groups of three.You would add a
warning()
call when 1-2 nucleotides at the end of the sequence are not used.Use the function
readLines()
to read the file lacz.fa, which contains the nucleotide sequence of E. coli lacZ gene in fasta format, and obtain the translated sequence.
6 Session Info
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] report_0.5.7 ggplot2_3.4.4 xlsx_0.6.5 formatR_1.14
[5] knitr_1.44 webexercises_1.1.0
loaded via a namespace (and not attached):
[1] gtable_0.3.4 jsonlite_1.8.7 dplyr_1.1.3 compiler_4.3.0
[5] tidyselect_1.2.0 scales_1.2.1 yaml_2.3.7 fastmap_1.1.1
[9] R6_2.5.1 labeling_0.4.3 generics_0.1.3 htmlwidgets_1.6.2
[13] tibble_3.2.1 insight_0.19.6 munsell_0.5.0 pillar_1.9.0
[17] rlang_1.1.1 utf8_1.2.4 getwiki_0.9.0 xfun_0.40
[21] parameters_0.21.2 datawizard_0.9.0 cli_3.6.1 withr_2.5.1
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.0 rstudioapi_0.15.0
[29] rJava_1.0-6 effectsize_0.8.6 lifecycle_1.0.3 vctrs_0.6.4
[33] evaluate_0.22 glue_1.6.2 farver_2.1.1 xlsxjars_0.6.1
[37] fansi_1.0.5 colorspace_2.1-0 bayestestR_0.13.1 rmarkdown_2.25
[41] tools_4.3.0 pkgconfig_2.0.3 htmltools_0.5.6.1