Tuesday, December 14, 2010

Why Use R?

I use R very frequently and take for granted much that it has to offer.  I forget how R is different from similar tools, so I have trouble communicating the benefits of using R.  The goal of this post is to highlight R's main strengths, but first... my story.

How I got started with R

I was introduced to R while I was working as a Research Analyst at the Federal Reserve Bank of St. Louis.  I wanted to do statistical analysis at home but the tools I used at work (GAUSS and SAS) were expensive, so I started doing my analysis in Excel.

But as my analysis became more complex, the Excel files became large and cumbersome.  The files also did not document my thought process, which made it difficult to revisit analysis I had started several months earlier.  I asked my fellow analysts for advice and one introduced me to R and Modern Applied Statistics with S.  Thus began my auto-didactic journey with R.

Why should you use R?

R is the leading tool for statistics, data analysis, and machine learning.  It is more than a statistical package; it’s a programming language, so you can create your own objects, functions, and packages.
Speaking of packages, there are over 2,000 cutting-edge, user-contributed packages available on CRAN (not to mention Bioconductor and Omegahat).  To get an idea of what packages are out there, just take a look at these Task Views.  Many packages are submitted by prominent members of their respective fields.
Like all programs, R programs explicitly document the steps of your analysis and make it easy to reproduce and/or update analysis, which means you can quickly try many ideas and/or correct issues.
You can easily use it anywhere.  It's platform-independent, so you can use it on any operating system.  And it's free, so you can use it at any employer without having to persuade your boss to purchase a license.
Not only is R free, but it's also open-source.  That means anyone can examine the source code to see exactly what it’s doing.  This also means that you, or anyone, can fix bugs and/or add features, rather than waiting for the vendor to find/fix the bug and/or add the feature--at their discretion--in a future release.
R allows you to integrate with other languages (C/C++, Java, Python) and enables you to interact with many data sources: ODBC-compliant databases (Excel, Access) and other statistical packages (SAS, Stata, SPSS, Minitab).
Explicit parallelism is straightforward in R (see the High Performance Computing Task View): several packages allow you to take advantage of multiple cores, either on a single machine or across a network.  You can also build R with custom BLAS.
R has a large, active, and growing community of users.  The mailing lists provide access to many users and package authors who are experts in their respective fields.  Additionally, there are several R conferences every year.  The most prominent and general is useR.  Finance-related conferences include Rmetrics Workshop on Computational Finance and Financial Engineering in Meielisalp, Switzerland and R/Finance: Applied Finance with R in Chicago, USA.
I hope that's a helpful overview of some benefits of using R.  I'm sure I have forgotten some things, so please add them in the comments.


Phil Rack said...

I really like this post because you show a progression of stat tools you have used. I think that's important for those who are considering using a new product. Everyone wants to see some common ground and how they've evolved their analytics stack over time.

Unknown said...

One thing I can't seem to figure out in R is how you would use multiple timeframes. Let's say you want to look at both daily and intraday data as part of the same study (daily to trigger entry, and intraday to figure out if your limit or stop is hit first). The only way I can figure out how to do this is using loops to keep daily and intraday "synchronized" - and that seems to slow everything to a crawl. Is there an easy solution to this that maintains R's high speed time series without resorting to loops? I'm admittedly an R newbie. Thanks!

Alex Rad said...

R has many idiosyncrasies and it does not optimize well for certain matrix operations. The documentation is also quite poor.

In terms of performance, numpy+python can knock down the best optimized programs. And in terms of comprehension, numpy+python wins again

Joshua Ulrich said...

Phil, thanks for the comment. I'm glad you liked the post.

Scott, I'm not sure I understand your question. You will probably get some good answers if you send an example of what you want to the R-SIG-Finance mailing list (follow the posting guide to get the best answers).

Alex, which matrix operations does R not optimize well? Using which BLAS? How is the documentation poor? Can you cite some performance comparison examples? How long have you used R? Python? Thanks for the comment.