R an SQL Server are a match made in heaven. You don't need anything special to get started beyond the basic instructions. Once you have jumped the hurdle of reliably and quickly transferring data between R and SQL Server you are ready to discover the power of a relational database when when combined with statistical computing and graphics.
In this article I will describe a way to couple SQL Server together with R, and show how we can get a good set of data mining possibilities out of this fusion. First I will introduce R as a statistical / analytical language, then I will show how to get data from and to SQL Server, and lastly I will give a simple example of a data analysis with R.
What is R and what noticeable features does it have
R is an open source software environment which is used for statistical data analysis. All operations are performed in memory, which means that it is very fast and flexible as long as there is enough memory available.
R does not have a storage engine of its own other than the file system, however it uses libraries of drivers to get data from, and send data to, different databases.
It is very modular in that there are many libraries which can be downloaded and used for different purposes. Also, there is a rapidly growing community of developers and data scientists which contribute to the library development and to the methods for exploring data and getting value from it.
Another great feature is that it has built-in graphical capabilities. With R it takes couple of lines of code to import data from a data source and only one line of code to display a plot graph of the data distribution. An example of this graphical representation will be given shortly. Of course, aside from the built-in graphics, there are libraries which are more advanced in data presentation (ggplot2, for example) and there are even libraries which enable interactive data exploration.
For more details on R features and on how to install it, refer to the R Basics article, which was recently published on Simple-talk.
Connecting to SQL Server from R
This part assumes that the reader has already gained some familiarity with the R environment and has the R and RStudio installed.
As mentioned, R does not have its own storage engine, but it relies on other systems to store the analyzed data. In this section we will go through some simple examples on how to couple R with SQL Serverís storage engine and thereby read data from, and write data to, SQL Server.
There are several options to connect to SQL Server from R and several libraries we can use: RODBC, RJDBC, rsqlserver for example. For the purpose of this article, however, we will just use the RODBC package
Letís get busy and setup our R environment.
In order to get the connectivity to SQL Server working, first we need to install the packages for the connection method and then we need to load the libraries.
To install and load the RODBC package, do the following:
Open the RStudio console (make sure the R version is at least 3.1.3: If it isnít, then use the updateR() function)
Run the following command: install.packages("RODBC")
Run the following command: library(RODBC)
Note: the R packages are usually available from the CRAN site, and depending on the server setup, they may not be directly accessible from the R environment, but instead it may be needed to be downloaded manually and installed manually. Here is the link to the package page: RODBC: http://cran.r-project.org/web/packages/RODBC/index.html
Exploring the functions in a package
R provides useful ways of exploring the functions of the R packages, If, for example, we wanted to list all functions in a specific package we would use a function similar to this:
lsp <- function(package, all.names = FALSE, pattern)
package <- deparse(substitute(package))
pos = paste("package", package, sep = ":"),
all.names = all.names,
pattern = pattern
And then we would call it like this:
Typing ??RODBC at the command prompt will bring out some help topics about the RODBC package.
Further, typing ? before any of the functions will bring out the help information about a function. For example,
For the purpose of this exercise, we will be using the AdventureWorksDW database (it can be downloaded from here).
Letís say we are intereste
d in calculating of the correlation coefficient between the annual sales and the actual reseller sales for each reseller. First we will create the following view:
CREATE VIEW [dbo].[vResellerSalesAmountEUR]
SELECT fact.ResellerKey ,
SUM(fact.SalesAmount) / 1000 AS SalesAmountK ,
dimR.AnnualSales / 1000 AS AnnualSalesK
FROM [dbo].[FactResellerSales] fact
INNER JOIN dbo.[DimReseller] dimR ON fact.ResellerKey = dimR.ResellerKey
WHERE fact.CurrencyKey = 36 -- Euro
GROUP BY fact.ResellerKey ,
I have divided the Sales and the Annual Sales amounts by a 1000, so it is easier to work with the numbers later on. We will go in details in the statistical analysis later on; letís start by getting connected.
First we need to create a variable with our connection string (assuming we have already loaded the library by running library(RODBC) ):