R is a programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for developing statistical software, and is widely used for statistical software development and data analysis. [Wikipedia]

After working with R as my main programming language for several years, I have met many challenges and done many searches for information. Through this web page I hope to collect some of my experience and share with other programmers, new or experienced.

Working efficiently with R

  • A complete computing environment with source code, the R Console, workspace, history, files, plots, packages and help pages visible at the same time (cross platform, freeware) is available through R Studio.
  • For Windows users Notepad++ and NppToR in combination with R gives colour coded source code and code sourcing from the editor.
  • Linux users are often more familiar with Eclipse which also has support for R (cross platform, freeware).

Web resources

  • The main resource for R, including R downloads, packages, documentation, task views and search engines is The R Project for Statistical Computing and its sub domain CRAN (the Comprehensive R Archive Network).
  • Among the many bloggs concerning R, the one to rule them all is R-bloggers. It is a vast resource to tips, tricks and code.

Quick programs

One of my favourite activities in front of the television in the evenings is making quick programs and efficient solutions for large or repetitive problems. An example can be found over at R-Forge where I have used Rcpp to combine C++ and R in the Needleman Wunsch package. This is one solution to computing similarities between two sequences using a global or semi-global search. Using C++ ensures minimal overhead in the computations, and reducing a matrix problem to a double vector problem with extensive reuse of memory ensures a small memory footprint.

Quick functions

Though I usually work with wide data matrices having from a few hundred to tens of thousands columns, I sometimes have to handle tall matrices. One such problem involved a little over 18 million milking records from more than 3 million cows. Associated with the cows were around 4 million health registrations that needed to be looked up in the cow table to assign additional attributes. Programming the lookup as a double for loop and testing it on a small subset of cows and registrations, I calculated that it would take around 29 days to complete the whole lookup on my fairly quick computer. After scratching my noodle and searching the web I stumbled upon the match() function. This does exactly what I needed, returning the index of the first exact match in the second vector for each element in the first vector. The difference from using the double loop was a reduction in time from 29 days (estimate) to 0.89 seconds for the whole job.

Leave a Reply