The R programming language is the result of a collaborative effort with contributions from all over the world. Initially written by Robert Gentleman and Ross Ihaka of the University of Auckland in the late 90s, it is popular and in wide use. It has been featured in the New York Times. Even estimates that are several years old have put the number of users above a ¼ million. The current number is certainly much higher. One popular LinkedIn group has 30,000 members. It has been featured in the New York Times. Polls on KDNuggets.com have placed its popularity even higher than the two players that have dominated statistical computing for decades: SPSS Statistics and SAS. The open source nature, and its corresponding price, are extremely attractive to academics and students. Critically, it is also very powerful.
So what’s the catch? Even its fans admit to a learning curve. It is a programming language, so there is no Graphical User Interface to get you quickly up to speed. Software environments have been created to support working in R, and many of them are popular, but nonetheless, there is some effort to be spent on getting started. On the upside, it is universally recognized as having fine graphics capability and if measured solely in terms of sheer volume, no commercial package can compete with the number of algorithms and methods available in R.
But there is a downside. Volunteers write the R ‘packages’ so some are better than others, and novices might struggle to know where to start. Documentation for some packages is also better than others. The most popular packages get attention, and with that comes quality and documentation. However, there is no escaping the fact that you have to be self- reliant to a larger degree than as a consumer of a commercial package like SPSS Modeler.
There is an amazing feature in SPSS Modeler 16 that takes the guess work out of the decision. Use both! Modeler 16 has three new nodes that allow you to use R for transformation, output, or modeling. Modeler already has great ETL capability. Why use R for that? It would be much more labor intensive. Modeler allows you to combine multiple data tables in minutes for example. The advice is to let each contribute its strengths.
Modeler is the most comprehensive data mining workbench among the commercial packages, but R has many more algorithms. If there is a time tested well documented algorithm in R that you desire you can simply add it to Modeler. The older option of a so called CLEF node (Component-Level Extension Framework) required more computer programming experience. Early indications are that the R option is at least an order of magnitude faster. With lots of functionality already written in R, what would have been days or even weeks, can now be made available for us in Modeler in a matter of hours.
Modeler also allows you to do simple exploration of your data very, very quickly. But what about presentation quality graphics? Modeler graphics are really for data exploration, not for final presentation. R graphics on the other hand are famous for their quality and the comprehensiveness of graphics options. Why not simply add additional graphics capability to Modeler? It has only been a few months, but my colleagues have already incorporated Rinto to how they think about and use Modeler. QueBIT is even starting to offer training on the subject. It is easier than you think.