Evolutionary Database Design

By: Martin Fowler
To read more DBA articles, visit http://dba.fyicenter.com/article/

Over the last few years we've developed a number of techniques that allow a database design to evolve as an application develops. This is a very important capability for agile methodologies. The techniques rely on applying continuous integration and automated refactoring to database development, together with a close collaboration between DBAs and application developers. The techniques work in both pre-production and released systems.

* Dealing with Change
o Limitations
* The Practices
o DBAs collaborate closely with developers
o Everybody gets their own database instance
o Developers frequently integrate into a shared master
o A database consists of schema and test data
o All changes are database refactorings
o Automate the refactorings
o Automatically Update all Database Developers
o Clearly separate all database access code
* Variations
o Keeping multiple database lineages
o You don't need a DBA
* Tools to Help
* Further Steps and Further Information

In the last few years, we've seen the rise of a new breed of software methodologies, the agile methodologies. These make some new and significant demands on database design. One of the most central of these demands is the idea of evolutionary design. On an agile project you assume that you cannot fix the requirements of the system up-front. As a result having a detailed design phase at the beginning of a project becomes impractical. The design of the system has to evolve through the various iterations of the software. Agile methods, in particular extreme programming (XP), have a number of practices that make this evolutionary design practical.

Many people have questioned whether evolutionary design can be applied to a system with a large database component. Indeed many people told us that it was impossible - a troubling thought as ThoughtWorks embarked on a large database-oriented project using many agile and XP techniques.

This article describes the practices that we've used to allow us to do this impossible thing. We won't say that we've completely solved the database evolution problem, but we do think we've demonstrated a set of techniques that many people will find useful. Dealing with Change

One of the primary features of agile methods is their attitude towards change. Most of the thinking about software process is about understanding requirements early, signing off on these requirements, using the requirements as a basis for design, signing off on that, and then proceeding with construction. This is a plan-driven cycle, often referred to (usually with derision) as the waterfall approach

Such approaches look to minimize changes by doing extensive up-front work. Once the early work is done, changes cause significant problems. As a result such approaches run into trouble if requirements are changing, and requirements churn is a big problem for such processes.

Agile processes approach change differently. They seek to embrace change, allowing changes to occur even late in a development project. Changes are controlled, but the attitude of the process is to enable change as much as possible. Partly this is in response to the inherent instability of requirements in many projects, partly it is to better support dynamic business environments by helping them change with the competitive pressures.

In order to make this work, you need a different attitude to design. Instead of thinking of design as a phase, which is mostly completed before you begin construction, you look at design as an on-going process that is interleaved with construction, testing, and even delivery. This is the contrast between planned and evolutionary design. One of the vital contributions of agile methods is that they have come up with practices that allow evolutionary design to work in a controlled manner. So instead of the common chaos that often happens when design isn't planned up-front, these methods provide techniques to control evolutionary design and make them practical.

An important part of this approach is iterative development, where you run the entire software life-cycle many times during the life of a project. Agile processes run complete life cycles in each iteration, completing the iteration with working, tested, integrated code for a small subset of the requirements of the final product. These iterations are short, usually running between a week and a couple of months, with a preference towards shorter iterations.

While these techniques have grown in use and interest, one of the biggest questions is how to make evolutionary design work for databases. Most people consider that database design is something that absolutely needs up-front planning. Changing the database schema late in the development tends to cause wide-spread breakages in application software. Furthermore changing a schema after deployment result in painful data migration problems.

Over the course of the last three years we've been involved in a large project (called Atlas) that has used evolutionary database design and made it work. The project involved almost 100 people in multiple sites world-wide (US, Australia, and India). It is around half a million lines of code and has over 200 tables. The database evolved during a year and a half of initial development and continues to evolve even though it's in production for multiple customers. During this project we started with iterations of a month, but after a few months changed to two week iterations which worked better. The techniques we describe here are the ones that we (or more accurately Pramod) used to make this work.

Since that project got going we've spread these techniques over more of our projects, gaining more experience from more cases. We've also found inspiration, ideas, and experience from other agile projects.

Limitations

Before we dive into the techniques, it's important to state that we haven't solved all the problems of evolutionary database design. In particular:
* We developed an application database for a single application rather than an integration database that tries to integrate multiple databases.
* We don't have to keep the production databases up 24/7

We don't consider these problems to be inherently unsolvable, after all many people believed we couldn't solve this one. But until we do, we won't claim we can solve them either.

The Practices
Our approach to evolutionary database design depends on a handful of important practices. DBAs collaborate closely with developers

One of the tenets of agile methods is that people with different skills and backgrounds need to collaborate very closely together. They can't communicate mainly through formal meetings and documents. Instead they need to be out talking with each other and working with each other all the time.Everybody is affected by this: analysts, PMs, domain experts, developers... and DBAs.

Every task that a developer works on potentially needs a DBA's help. Both the developers and the DBA need to consider whether a development task is going to make a significant change to the database schema. If so the developer needs to consult with the DBA to decide how to make the change. The developer knows what new functionality is needed, and the DBA has a global view of the data in the application.

To make this happen the DBA has to make himself approachable and available. Make it easy for a developer to just pop over for a few minutes and ask some questions. Make sure the DBAs and developers sit close to each other so they can easily get together. Ensure that application design sessions are known about so the DBA can pop in easily. In many environments we see people erecting barriers between the DBA and application development functions. These barriers must come down for an evolutionary database design process to work.

Everybody gets their own database instance
Evolutionary design recognizes that people learn by trying things out. In programming terms developers experiment with how to implement a certain feature and may make a few attempts before settling down to a preferred alternative. Database design can be like that too. As a result it's important for each developer to have their own sandbox where they can experiment, and not have their changes affect anyone else.

Many DBA experts see multiple databases as anathema, too difficult to work in practice, but we've found that you can easily manage a hundred or so database instances. The vital thing is to have to tools to allow you to manipulate databases much as you would manipulate files.

Developers frequently integrate into a shared master

Although developers can experiment frequently in their own area, it's important to bring the different approaches back together again frequently. An application needs a shared master database that all work flows from. When a developer begins a task they copy the master into their own workspace, manipulate, and then integrate their changes back into the master. As a rule of thumb each developer should integrate once a day.

Let's take an example where Mike starts a development task at 10am (assuming he actually comes in that early). As part of this task he needs to change the database schema. If the change is easy, like adding a column, he just decides how to make the change himself, Mike also makes sure the column he wants to add does not already exist in the database, with the help of the data dictionary (discussed later). If it's more complicated then he grabs the DBA and talks over the likely changes with him.

Once he's ready to begin he takes a copy of the database master and can modify both the database schema and code freely. As he's in a sandbox any changes he makes don't impact anyone else's. At some point, say around 3pm, he's pretty comfortable that he knows what the database change needs to be, even though he's not completely done with his programming task. At that point he grabs the DBA, and tells him about the change. At this point the DBA can raise any issues that Mike hasn't considered. Most of the time all is well and the DBA goes off and makes the change (by applying one or more database refactorings, which we'll come to below). The DBA makes the changes right away (unless they are destructive changes - again more on that below). Mike can continue to work on his task and commit his code any time he likes once the DBA has applied these changes to the master.

You may well recognize this principle as similar to the practice of Continuous Integration, which is applied to source code management. Indeed this is really about treating the database as another piece of source code. As such the master database is kept under configuration management in much the same way as the source code. Whenever we have a successful build, the database is checked into the configuration management system together with the code, so that we have a complete and synchronized version history of both.

Full article...