Letís be honest: working with big databases is a lot of fun. Thereís something cool about dealing with tables that have hundreds of millions or billions of rows in them, loading huge amounts of data, building star and snowflake schemas for data warehouses/marts, optimizing query performance, and all that jazz. Yes, working with big databases is a lot of fun.
On the other hand, letís be honest: working with big databases is not a lot of fun. Thereís a lot of pain in dealing with tables that have hundreds of millions or billions of rows in them, waiting for huge amounts of data to be loaded only to have the load job toss its cookies and fail when itís 99% done, building special schemas that you wonder whether make any difference at all, and trying to figure out why just a simple two-way join query has been hanging for over an hour. Yes, working with big databases is not a lot of fun.
But hereís the thing: big databases arenít going away; in fact, thereís more of them than ever before and they just keep multiplying and getting bigger. I could quote a lot of industry statistics and TDWI reports on this fact, but that would just bore you because you likely know all this stuff already. You also know that there are a number of database vendors out there who can take on big data, but they also come with big price tags.
Surveys done by MySQL show that data warehousing is the fifth most common use case for MySQL users, so itís understandable that many try and get around the sticker shock of proprietary data warehouse DBís with MySQL. One of the challenges is scaling MySQL for very large data volume sizes or for complex/analytic queries with fast response times. Moreover, one of the drawbacks that most generic RDBMSís have is a row-based architecture, whereas most data warehouses and read-intensive query situations are selective column-based use cases.
Lately, the idea of column-oriented databases has been catching new wind in its sails. I say Ďnew windí because Sybase IQ did the column database thing when columns werenít that cool. But now there are a number of column-oriented databases that are really taking off, however most still are proprietary and cost quite a bit. But what happens when you marry a column-oriented database thatís modular in design, can scale both up and out, with MySQL and open source? You get Calpontís InfiniDB.
To be frank, Iím as skeptical as they come when claims are made that say how this or that different Ďtypeí of database is better than a general purpose RDBMS. When object-oriented databases came out, I scoffed. When OLAP/cube databases were hogging all the tech press, I smirked. I believed that neither could really challenge the good olí relational model, which isnít perfect by any means, but hey: it still has enough punch to relegate those two other challengers to the very boutique marketplace. That being the case, why should we entertain a relational database that is vertical vs. horizontally architected?
Because, for querying and processing medium to huge amounts of data, they actually work. And, many times, they work better at finishing the SQL query horse race than a general purpose RDBMS. That's why.
Before column databases came out, experienced data modelers creating data warehouses, marts, and reporting databases knew that by designing vertically partitioned tables that were subsets of an overall table, they could sometimes buy better physical query performance because the rows were shorter and thus I/O could somewhat be reduced. Using views, theyíd then tie the vertically partitioned tables together to form one logically-based table that could be referenced. And sometimes this practice worked. But oftentimes, it didnít make the grade because (1) even the vertically partitioned tables were still row-based and I/O wasnít always reduced by much; (2) it was really hard to predict what columns would be consistently queried together; (3) the practice didnít reduce the need for indexes (and in fact made the matter worse sometimes because you needed indexes to smartly do joins between the vertically partitioned tables in the view definitions); (4) like with the difficulty in predicting what columns would be queried together, it was difficult to do index design because userís query patterns could change at the drop of a hat and suddenly queries would do full scans instead of index searches. And on and on it went.
Column-oriented databases designed especially for analytics overcome these limitations by storing, managing, and querying data based on columns rather than rows. Because only the necessary columns in a query are accessed rather than entire rows, the column can act as the index and I/O activities as well as overall query response times can oftentimes be reduced. The end result is the ability to interrogate and return query results against either moderate or large amounts of data in a pretty darn efficient manner. On average, a row-based system can do up to 5-10x the physical I/O that a column-based database does to retrieve the same information. This is why column bigot Philip Howard (Research Director for Bloor Research) says: "Columns provide better performance at a lower cost with a smaller footprint: it is difficult to understand why any company seriously interested in query performance would not consider a column-based solution."
Nice quote, but should you really check out a column database? Can they really help you? Read onÖ
Whatís it Have?
Column databases can be a help for sure in the right situation, but thereís more thatís needed to crunch through large data volumes than just having a vertical data design. In addition to being column-oriented, the open source version of InfiniDB provides the following core feature set, which adds a number of needed items to the mix:
* Column-oriented architecture: mentioned already...
* Multi-threaded design: the engine is multi-threaded and uses todayís modern multi-CPU/core architectures to split a single query up across the available processing units.
* Automatic vertical and horizontal partitioning: Being column-oriented, the engine uses vertical partitioning to store data but it also uses a form of logical horizontal range partitioning that does not require special storage placements or design. Using both vertical and logical horizontal range partitioning allows it to reduce I/O in both directions (column and row). Both vertical and horizontal partitioning are automatically handled so you donít have to manually do anything.
* High user concurrency: There are no real limits as far as concurrency is concerned.
* High-speed data loader: To effectively load lots of data, a high-speed load utility is made available. Iíll show you how this works in just a second.
* DML support: In addition to supporting the high-speed bulk loading of data, the engine supports full DML (insert, update, delete) operations as well.
* Transactional support: ACID-compliant transactional support is provided in the engine. Commit, rollback, deadlock detection, all that good stuff.
* Crash recovery: the engine has full crash recovery capabilities.
* MVCC design: who doesnít like Ďsnapshot readí? I do, and the engine has it.
* No need for indexing: Because the engine uses both vertical and logical horizontal partitioning, there is no need for indexing. In essence, the data is the index. In addition, the engine automatically maintains a small, important structure called the Extent Map (explained later), which is used to reduce I/O. Besides no indexes, you also donít have to worry about dealing with materialized views, summary tables (of course, in MySQL, you donítí have these anywayÖ), or manual data partitioning.
* Logical Data compression: the engine doesnít have real physical/file compression yet (comingÖ) so for now, it uses transparent logical data compression to store data that benefits from being compressed.
* ALTER TABLE is supported (add, drop).
* Performance diagnostics: To help troubleshoot those queries from hell, InfiniDB supplies a nice tracing/diagnostic utility that helps figure out exactly what a query is doing above the normal EXPLAIN output.
InfiniDB also uses MySQL as its front end, so if youíre familiar with MySQL, youíll be right at home with InfiniDB. MySQL is used mainly in InfiniDB for security, SQL parsing, and initial query plan output.
In addition to the above open source product feature set, a commercial/Enterprise version of InfiniDB will be offered that has a pay-for scale out option where multiple machines can be used in a massive parallel processing (MPP) configuration. However, for many data warehousing, data mart, and analytic database tasks, the open source version of InfiniDB provides plenty of power to plow through big databases.
The architecture of InfiniDB is modular and consists of three basic components:
* User Module: The User Module is made up of a small MySQL instance and a number of InfiniDB processes that handle concurrency scaling. The User Module is also responsible for breaking down SQL requests and distributing the various parts to one or more Performance Modules that actually retrieve requested data from either memory caches or disk. Finally, the User Module assembles all the query results from the various participating Performance Modules to form the complete query result set that is returned to the user. Note there can more than one User Module in an InfiniDB Enterprise configuration so you can do concurrency scale out by adding more User Module nodes.
* Performance Module: The Performance Module is responsible for retrieving and managing data, and passing it back to the User Module(s) to satisfy query requests. The Performance Module selects data from disk and caches it in a shared nothing data cache that is part of the server on which the Performance Module resides. InfiniDB Enterprise accomplishes its MPP benefits by allowing the user to configure as many Performance Modules as they would like; each additional Performance Module adds more cache to the overall database as well as more processing power.
* Storage: InfiniDB can use either local storage or shared storage (e.g. SAN) to store data. A user can have everything configured and running on one server. In the commercial edition of InfiniDB, they can scale out with multiple servers and configure either a shared disk (currently supported) or shared nothing (coming in future release) architecture.