DBA > Articles

MySQL 5.1 partitions in practice

By: Giuseppe Maxia
To read more DBA articles, visit http://dba.fyicenter.com/article/

Purpose
This article explains how to test the performance of a large database with MySQL 5.1, showing the advantages of using partitions. The test database uses data published by the US Bureau of Transportation Statistics. Currently, the data consists of ~ 113 million records (7.5 GB data + 5.2 GB index).

Getting and loading the data
The data for this exercise comes from the Bureau of Transportation Statistics. The data is provided as CSV files, and they are available from July 1987 to July 2007 (at the time of writing). The details of getting the data, setting the data structure and loading the data are explained on MySQL Forge. Problem specification

The test affects a database that is bigger than the amount of RAM in the server, and also the indexes are bigger than the RAM. The server used for this exercise has 4 GB of RAM, and the size of indexes is over 5 GB.

The reasoning behind this specification is that data warehouses have data collections that are far beyond any reasonable amount of RAM that you can possibly install, occupying several terabytes of storage.

In normal database storage, especially for OLTP, the indexes are cached in RAM, to allow for fast retrieval of records. When the data reaches sizes that can't be contained in the available RAM, we need to use a different approach.

One of MySQL 5.1 main features is partitioning, a technique that divides a table into logical portions to speed-up retrievals.

Using MySQL 5.1 partitions looks simple in principle, but there are some tricky points to be aware of while setting the data for maximum performance.

This article will examine the risks and offer some practical advice to achieve the best performance.
Partitioning overview

The current implementation of partitioning in MySQL 5.1 is quite simple. You can partition your data by
* range
* list
* hash
* key

Depending on your needs, you may choose different partitioning types. In this article we concentrate on range partitioning, which is perhaps the most interesting for data warehousing.

MySQL partitioning has some constraints that you must be aware of if you want to use this feature effectively.
* the partitioning value must be an integer;
* if the table has a unique/primary key, the partitioning column must be part of that key.

The first limitation is the one that has the biggest impact on your design decisions. If the column that you need to use for partitioning is not an integer, you need to use a function to transform it. Some additional constraints apply to partitions, as described in the manual, but we are not concerned about it here.

Partitioning gotchas
Using date columns

What is relevant in this context is the usage of date columns for partitioning. Since the native data type is not supported, we must convert the date into an integer. In addition to the list of allowed functions, we must take into account the fact that only two date functions can trigger the partition pruning. Thus, if we have to deal with a date column, we need to use one of them (YEAR or TO_DAYS).

When using the YEAR() function, partitioning is easy, readable, and straightforward.


CREATE TABLE by_year (
   d DATE
)
PARTITION BY RANGE (YEAR(d))
(
PARTITION P1 VALUES LESS THAN (2001),
PARTITION P2 VALUES LESS THAN (2002),
PARTITION P3 VALUES LESS THAN (2003),
PARTITION P4 VALUES LESS THAN (MAXVALUE)
)

Partitioning by month is trickier. You can't use the MONTH() for two reasons:
* you would be limited to 12 partitions, because MONTH does not include the year;
* the MONTH function is not optimized for partition pruning, thus performances would be horrible.
Thus, you need to use the other function that is optimized for partition pruning, TO_DAYS.

CREATE TABLE by_month (
   d DATE
)
PARTITION BY RANGE (TO_DAYS(d))
(
PARTITION P1 VALUES LESS THAN (to_days('2001-02-01')), -- January
PARTITION P2 VALUES LESS THAN (to_days('2001-03-01')), -- February
PARTITION P3 VALUES LESS THAN (to_days('2001-04-01')), -- March
PARTITION P4 VALUES LESS THAN (MAXVALUE)
)

That's already less clear to read than the one partitioned by year. What's worse is that the server won't retain your values, but it will only save the corresponding integers. This is what you get for the above table:

show create table by_month\G
*************************** 1. row ***************************
       Table: by_month
Create Table: CREATE TABLE `by_month`
  `d` date DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 /*!50100 
PARTITION BY RANGE (TO_DAYS(d)) 
(
PARTITION P1 VALUES LESS THAN (730882) ENGINE = MyISAM, 
PARTITION P2 VALUES LESS THAN (730910) ENGINE = MyISAM, 
PARTITION P3 VALUES LESS THAN (730941) ENGINE = MyISAM, 
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = MyISAM) */

It is advisable to save a copy of the script used to create tables partitioned by month, if you want to have a readable reference of what each partition means.

Full article...


Other Related Articles

... to read more DBA articles, visit http://dba.fyicenter.com/article/