Why Should I Check Out a MySQL-Based Column Database ?

By: Robin Schumacher
To read more DBA articles, visit http://dba.fyicenter.com/article/

Some technologies come on the information technology landscape and stay, providing long-lasting benefits, whereas others are more of a short term fad and ultimately end up disappearing because the value they supplied was too niche oriented and/or they were quickly supplanted by another technology that is better. Recently, articles, blogs, analyst reports, and other media outlets have been noting the rise and usage of column-oriented databases in the areas of data warehousing, analytics, and other business intelligence/read-intensive situations. And on the MySQL front, there are a couple of column DB’s that are now available for you to use.

Some technologies come on the information technology landscape and stay, providing long-lasting benefits, whereas others are more of a short term fad and ultimately end up disappearing because the value they supplied was too niche oriented and/or they were quickly supplanted by another technology that is better. Recently, articles, blogs, analyst reports, and other media outlets have been noting the rise and usage of column-oriented databases in the areas of data warehousing, analytics, and other business intelligence/read-intensive situations. And on the MySQL front, there are a couple of column DB’s that are now available for you to use.

Are column-oriented databases a technology that is destined to stay and provide long-term benefits or will it be relegated to the forgotten pile of other software that came on the scene quickly and then disappeared?

Let’s look at three key questions that are consistently asked of column-oriented databases and see how the technology stacks up:
1. How do column-oriented databases work?
2. Do column-oriented databases really make a difference?
3. What learning curve (application/database development, etc.) is involved with column-oriented databases?

How Do Column-Oriented Databases Work?
All the legacy relational databases currently being offered today were and are primarily designed to handle online transactional processing (OLTP) workloads. A transaction (e.g. an online order for a book through Amazon or another Web-based book dealer) typically maps to one or more rows in a relational database, and all traditional RDBMS designs are based on a per row paradigm. For transactional-based systems, this architecture is well-suited to handle the input of incoming data.

However, for applications that are very read intensive and selective in the information being requested, the OLTP database design isn’t a model that typically holds up well. Whereas transactions are row-based, most database queries are column-based. Inserting and deleting transactional data are well served by a row-based system, but selective queries that are only interested in a few columns of a table are handled much better by a column-oriented architecture. On average, a row-based system does 5-10x the physical I/O that a column-based database does to retrieve the same information. Taking into account that physical I/O is typically the slowest part of a query, and that an analytical query typically touches significantly more rows of data that a typical transactional database operation, the performance gap between row-oriented architectures and column-oriented architecture oftentimes widens as the database grows.

To get around their selective query inefficiencies, row-based RDBMS’s utilize indexing, horizontal partitioning, materialized views, summary tables, and parallel processing, all of which can provide benefits for intensive queries, but each comes with their own set of drawbacks as well. For example, while indexing can certainly help queries complete faster in some cases, they also require more storage, impede insert/update/delete and bulk load operations (because the indexes must be maintained as well as the underlying table), and can actually degrade performance when they become heavily fragmented. Moreover, in business intelligence/analytic environments, the ad-hoc nature of such scenarios makes it nearly impossible to predict which columns will need indexing, so tables end up either being over-indexed (which causes load and maintenance issues) or not properly indexed and so many queries end up running much slower than desired.

Those not familiar with a column-oriented database might wonder exactly what they are and what actual benefits they deliver over a legacy RDBMS. It’s important to note that, on the surface, a column-oriented database appears exactly like a traditional relational database: the logical concepts of tables and rows are the same, SQL commands are used to interact with the system, and most other RDBMS paradigms (e.g. security, backup/recovery, etc.) remain unchanged.

But, a column-oriented database specifically designed for analytics overcomes the query limitations that exist in traditional RDBMS systems by storing, managing, and querying data based on columns rather than rows. Because only the necessary columns in a query are accessed rather than entire rows, I/O activities as well as overall query response times can be reduced. In other words, if you don’t have to read an entire row to get the data you need, why do it?

The end result for column databases is the ability to interrogate and return query results against either moderate amounts of information (tens or hundreds of GB’s) or large amounts of data (1-n terabytes) in much less time that standard RDBMS systems can.

Full article...