A brief introduction to R

R is a general purpose computer language, particularly suited to analyzing large data. It is often called "R environment" because it was created as an interactive shell for data exploration. On Linux and Unix systems this interactive shell is started by typing R into the terminal. One can also run R programs ("scripts") non-interactively by using command Rscript.

R type system

From the user point of view, R has a particularly simple type system. The atomic types are vectors of integers, doubles or character strings and everything else is built up using lists. For example, a table (referred to as a data.frame in R) is simply a list of vectors of the same length. Lists look like linear arrays and allow to retrieve entries by an index. The only difference from atomic types is that lists hold values of any type, including another list.

This is quite brilliant because vector operations such as addition or multiplication do not need explicit loops and are performed by fast internal R functions.

Every object, such as vector or list, can also have a set of named values called "attributes". These describe some special properties of the object, such as its class name, array dimensions or the names of its entries.

A list or vector with the names attribute allows you to retrieve entries not only by numeric index but also by a character string matching one of the values stored in names. Such a named list behaves like an associative array or a dictionary.

Other system types, such as those that describe functions, are primarily for internal use.

R syntax

R uses a C-like syntax with the primary difference being the assignment operator, which is <- instead of =. This is a historical convention, but not very surprising - R has some symbolic computation facilities and symbolic computation languages often have several assigment operators with different properties, denoted by different symbols. R is no exception, as = has a meaning of assignment to global environment. I only need this very rarely and then prefer to use R assign() function instead.

Let's look at R code example:

N<- 10

f2<- function(x, y=1) {
        return(x*x+y)
        }

print(f2(1:N))

We see an assignment of 10 to variable N. A function is created with a functor function() and then simply assigned to a variable. It is perfectly possible to override system functions that way, but this is not a good practice.

R does not require a semicolon at the end of every line, though you can use one to separate statements.

A colon is used to produce sequences of integers such as 1:N, if you need more sophisticated sequences use built-in function seq()

We also see that the vector operations allow adding (or multiplying) vectors of different length. What happens is that the shorter vector is copied ("recycled") to the length of the longer vector. So, in the code above y is promoted to a constant vector of ones

R libraries

R has many libraries or packages that extend its functionality. There is a convenient system of installing packages from a central repository. The default repository is called CRAN, there is also Bioconductor focusing on bioinformatics.

To use Gravitational wave atlas or Gaia data you would need package RMVL. You can install it with the following command:

install.packages("RMVL")

Many other packages are available, especially for statistical analysis.

R scripts

R source code is typically stored in files with extension .R. These are called "R scripts" because they can be executed directly, without a separate compilation step. You can start a script like this:

source("view_summary.R")

pdf("ul_plot.pdf", width=8, height=8)
plot_gw(570, 600, "snr")
dev.off()

Here we source one of the examples from gravitational wave atlas and then use the function plot_gw() defined in the example to make a plot.

As you explore data, R keeps a history of commands you entered which can be retrieved using history(Inf). These commands can then be pasted into a new .R text file. You now have an R script that repeats your analysis.

Data representation in R

There are several ways to represent data in R:

Vectors, arrays and matrices: These store data of the same type in a linear format, one item after another. Both matrices and arrays are also considered vectors. Matrices can only be two-dimensional, while arrays can have as many dimensions as you wish, which is convenient to represent tensors. You find the length of a vector, array or matrix with function length() and you retrieve dimensions of array or matrix with function dim() which returns a vector of integers. The product of those integers is equal to length. You can transpose matrices with function t() and multiply them with operator %*%. There are many linear algebra functions, in particular matrix decomposition svd() and linear solver solve() which is also used to invert matrices.
Data frames: these are lists of vectors of equal length. Crucially, the vectors can have different types, which can be used to create labels and indices. There are many functions that take advantage of this feature, most notably merge(), aggregate() and reshape(). A data frame is very much like a table in SQL. There are convenient functions to plot data from data frames, most notably xyplot(). There are analysis functions as well, for example lm() and glm() do regression.
MVL objects: an MVL object is simply an R object stored in MVL file and accessed via RMVL library. Vectors, arrays, matrices, data frames and lists can be stored this way. Storing data in MVL file speeds up loading and allows sharing between R processes. If you have an MVL object X that is a matrix or data frame you can retrieve some of its rows with X[10:20,], other forms of indexing are supported as well. Many functions such as lm() require a pure R object, you can obtain such with function mvl2R(). MVL file format was designed to store very large datasets.

R help system

R includes a built-in help system that is accessed via the help() function or by typing ? followed by the function name. For example, ?lm provides documentation for the linear regression function. You can search the index of all entries with two questions marks, like this ??bessel. Importing a library adds entries to the index.

You can view source code of R functions to understand how they work. For example, print(lm) will print the source for function lm(), which performs regression. This is possible because in R you create a named function by assigning code to a variable.

R plots

R offers several libraries for making plots. The one that is included by default has functions plot(), lines(), image(), legend() and many others. It is great for constructing a plot one element at a time, and there are many parameters to adjust to make the plot publication quality.

I also use library lattice which is very convenient when exploring data, especially in data frames. A key function from lattice is xyplot(). The trade-off for this ease of exploration is that it is more difficult to force specific plot layout.

For visualizing large datasets, you can plot millions of points quickly by specifying a fast-rendering plotting character, such as pch="+" and pch=".", as an additional named argument to plot() or xyplot().

Don't forget to use help() to discover more plotting functions and parameters!