If you've accepted Codd’s rules and you lead a purposeful life, then your concept of data is relatively simple compared to the heathens who still worship at the altar of the flat file. Your schemas are BCNF or better and your SQL isn't polluted with RBAR; just JOINs, DRI, and a view or two.
But, there's more to data than the RM (Relational Model). If data is your addiction of choice, I've got something to tempt you... statistics, embedded with relational databases. A currently prominent meme is “data science", and the publishers are printing to a fare-thee-well. It happens that there can be an integration of the base science of mathematical statistics and relational databases to make a more integrated manifestation of “data science”. It was recently reported that R and SQL are the most-used languages of those practicing data science. In practice, that means the analyst dumps some rows from a database, imports them into a session with the statistical package, generates some output, massages this for use by a client program, puts the output in some place, rinse, repeat. “There are more things in heaven and earth, Horatio...”
My goal is to impel you to consider data science as an extension of the relational database, and even to explore the world of inferential statistics. By ‘inferential’, I mean the kind of statistics that support predictions, as opposed to descriptive statistics, which merely tell you the state of the world you're looking at. I offer not a "how to", but a "can do... and you can do even more".
The way to achieve this is by using statistical packages, or ‘stat packs’. Stat packs are to the world of the mathematical statistician what DDL generators are to the professional database developer—a means for the ignorant to wreak great havoc. Having been party to ISO-9000 and Six Sigma implementations, I've seen the results first hand. I don't follow their lead, anointing acolytes as statistician-lites; Instead, I encourage readers to invest some time to really understand what goes on in the field of predicting from data.
And if you're going to do predictions from data, do it from the database... I'm not talking about vanilla stat packs talking *to* some database, for they all can do that: Rather, I’m talking about the database calling stat pack routines as it does any other engine function; different, and very intriguing .
Later in this article, I'll be providing an example scenario using PostgreSQL (more commonly known as Postgres), since it has integration support with a stat pack. The stat pack, which has been a decade-long overnight success, is R. The name is just R. It's an open source alternative to S, from the 1970s. Kudos to Joe Conway for writing PL/R, the R procedural language for Postgres. Note that I'm not going to advance tutorials on the constituent parts of the example (mathematical statistics, PostgreSQL, PL/R and R); there are many and the bibliography at the end of this article contains links to resources.
But before that, I’ll begin with some background information on the tools we’ll use.
All of the four most-known stat packs today have roots at least back to the 70s:
SAS (originally, ‘Statistical Analysis System’) is the big “kahuna” of the field. SAS is expensive, and by design, intended to be a total data management experience. Because of a decision early on to make the product a one-stop data system, it is not likely to be integrated into any particular RDBMS. SAS Institute Inc is a private company that is highly unlikely to be acquired.
SPSS (originally, ‘Statistical Package for the Social Sciences’) was bought by IBM, and is being integrated. I expect to see it being promoted for use directly in DB2 at some point. SPSS is known to be more SQL friendly than SAS.
Both SAS and SPSS emerged in the late 1960's on mainframes (not just IBM; there really were others back then), and (along with BMDP Statistical Software), were used primarily in academic research. As an aspiring econometrician, I used them all.
Minitab is the stat pack of choice for ISO-9000 and Six Sigma contract instructors. It provides Shewhart control charts and the like. It is not what we're looking for in our example.
There isn't a SQL Server-centric stat pack that I'm aware of, although XLeratorDB from WestClinTech does provide some support (I've not used it).
So, off we go to R. R is an open source implementation of S (which is commercially offered as S-Plus), which was first developed at Bell Labs in the mid-1970s. What makes R so popular with the data analysis crowd is a very large, and growing, set of functions (available as entities called packages) written mostly in R and easily available. The main repository is CRAN. This is one case where "free" hasn't been the impetus for adoption; it's the large community of users/developers.
The principle difference between R and the others can be summed up, thus: SAS/SPSS/etc. are statistical command languages, while R is a statistical programming language. This difference gives R an advantage when integrating to a RDBMS.
Historically used by engineers and scientists, its host language tends strongly to Python (sorry, C#-users). R can read SQL databases with RODBC and RJDBC R-side drivers. In addition, there are R-specific database drivers, but for (mostly) open source databases: RMySql, RPostgreSQL, RSQLite, and ROracle; no RSQLServer as yet (sorry, SQL Server users). The database drivers allow more or less direct read/write of the database from R under the R-specific protocol DBI. Connecting to the database from an R client session is the usual method of integrating. In our example, we will want to run R functions against the data from the database.
There are a host of online tutorials and suchlike; some are listed in the bibliography, along with the texts I've used. There are a couple of other R books out, but not all that many, curiously.
Here is a review of the code base of R and the contributions: How much of R is written in R: Part 2 Contributed Packages.