Building a Hybrid Data Warehouse Model
By: James Madison
Building a Hybrid Data Warehouse Model
by James Madison
As suggested by this reference implementation, in some cases blending the relational and dimensional models may be the right approach to data warehouse design.
Published April 2007
Relational and dimensional modeling are often used separately, but they can be successfully incorporated into a single design when needed. Doing so starts with a normalized relational model and then adds dimensional constructs, primarily at the physical level. The result is a single model that can provide the strengths of its parent models fairly well: it represents entities and relationships with the precision of the traditional relational model, and it processes dimensionally filtered, fact-aggregated queries with speed approaching that of the traditional dimensional model.
Real-world experience was the motivation for this analysis: on three separate data warehousing projects where I worked as programmer, architect, and manager, respectively, I found a consistent pattern of data/database behavior that lent itself far more to a hybrid combination of dimensional and relational modeling than to either one alone.
This article discusses the hybrid design and provides a fully functional reference implementation. The system runs on Oracle Database 10g. It contains all code needed to build the database schemas, generate sample data, load it into the schemas, build the indexes and materialized views, run the sample queries, capture the runtimes, and provide statistics on the runtimes.
The hybrid model is not a one-size-fits-all solution. Many projects are best served by either using only one of the traditional models or using both models separately with a feed between them. But if the objective is to create a single database that can both store data in its properly normalized form and run aggregation queries with good performance, the hybrid model is a design pattern to consider.
Sample Business Domain
The sample business domain is in the insurance industry and uses the following entities:
The sample business questions used to analyze the performance of the system have some parallel with reality but also cover extremes of behavior: scanning the fact table for many rows, retrieving a tiny percentage of fact rows, restricting to only the top table, restricting to every table, restricting to only the lower tables, and so on. They are the kinds of questions business users ask of dimensional models, not the kinds of questions that are typically asked of relational models. The relational model questions are not addressed, because it is assumed that the relational model will outperform the dimensional model for questions of a relational nature, such as "Show me all the vehicles on this policy." The questions used in this analysis are the following:
The three models are presented in Figures 1, 2, and 3. The hybrid model is based on the relational model, with two changes that derive from dimensional modeling practices: (1) Create a relationship from the PREMIUM table to each table in the upper portion of the hierarchy, and (2) Add the time dimension.
Largely standard techniques were used to convert the models into their physical implementation in database schemas. The relational schema was created with normalized modeling techniques, and the dimensional schema was done according to Ralph Kimball's work. Creating the hybrid meant copying the relational schema and then layering the dimensional constructs on top of it. (The "File Descriptions" sidebar lists the most important files in the implementation--which includes those files with DDL, the system validation, the queries, and the automated analysis used to generate the sample code.)
Because only three nonkey attributes are used, a SIZING attribute is added to each table, with a type of CHAR(100) to make the row size more realistic.
Certain database parameters must be set so that star joins will occur and materialized views will be used. The important parameters are shown here:
NAME VALUE ------------------------------ -------------------- compatible 10.2.0.1.0 optimizer_features_enable 10.2.0.1 optimizer_mode first_rows pga_aggregate_target 83886080 query_rewrite_enabled true query_rewrite_integrity stale_tolerated sga_target 167772160 star_transformation_enabled true
Verifying that a star join is occurring is done with EXPLAIN PLAN, as detailed in Oracle documentation.
All three schemas were loaded with the same data. The best evidence of consistent data loading is that all three schemas produce the same answers for the sample queries.
The volume of data used for the analysis is shown below.
OWNER TABLE_NAME NUM_ROWS AVG_ROW_LEN LAST_ANALYZED ------ ------------ ---------- ----------- ------------------- DIM ACCOUNT_DIM 2000 128 2006-01-14:19-51-56 COVERAGE_DIM 900 17 2006-01-14:19-51-57 POLICY_DIM 6000 128 2006-01-14:19-51-58 PREMIUM_FACT 1371183 23 2006-01-14:19-52-14 TIME_DIM 3600 21 2006-01-14:19-52-39 VEHICLE_DIM 24000 130 2006-01-14:19-52-39 HYB ACCOUNT 2000 128 2006-01-14:19-53-42 COVERAGE 144000 28 2006-01-14:19-53-47 POLICY 6000 142 2006-01-14:19-53-53 PREMIUM 1373463 49 2006-01-14:19-54-41 TIME_DIM 3600 21 2006-01-14:19-55-08 VEHICLE 24000 144 2006-01-14:19-55-10 REL ACCOUNT 2000 124 2006-01-14:19-39-22 COVERAGE 144288 27 2006-01-14:19-39-30 POLICY 6000 138 2006-01-14:19-39-31 PREMIUM 1389963 29 2006-01-14:19-40-08 VEHICLE 24000 139 2006-01-14:19-40-13
The goal was to provide a sufficiently large volume to prevent the optimizer from taking shortcuts, such as reading entire tables instead of using indexes and other such optimization techniques that would undermine the analysis. According to Oracle Database Data Warehousing Guide 10 g Release 2 (10.2), Schema Modeling Techniques, a star transformation might not occur if the optimizer finds "tables that are too small for the transformation to be worthwhile."
A fairly arbitrary goal of the implementation was to have at least 1 million rows in the fact table. Given that all dimensional and hybrid query plans generated by QUERIES.SQL meet the criteria of star joins, the data volume used appears to be sufficient for the current analysis.
The number of COVERAGE_DIM rows is smaller in the dimensional schema than in the DIMENSION tables of the other two schemas because of the way a weak entity has to be represented in the dimensional schema.
Here is the amount of space consumed by the various schemas:
OWNER TOTAL_SIZE --------------- ---------------- DIM 129,499,136 HYB 244,056,064 REL 130,023,424
Because the hybrid schema is a combination of the relational and the dimensional, it follows that it should be roughly the size of both, minus any common elements, and the numbers bear this out.
Running the System
Each of the queries was run 21 times, and the median runtime was used as the representative value, as shown below.
EVENT WINNER_TIME RNR_UP_TIME LOSER_TIME ----- -------------------- -------------------- -------------------- 1. DIM = 00:00:06.049 REL = 00:00:09.023 HYB = 00:00:09.644 2. DIM = 00:00:04.186 HYB = 00:00:07.961 REL = 00:00:08.092 3. DIM = 00:00:03.415 HYB = 00:00:04.938 REL = 00:00:05.428 4. DIM = 00:00:00.140 HYB = 00:00:00.190 REL = 00:00:06.990 5. HYB = 00:00:00.131 DIM = 00:00:00.651 REL = 00:00:05.418 6. DIM = 00:00:00.530 HYB = 00:00:01.392 REL = 00:00:05.478 7. DIM = 00:00:00.520 HYB = 00:00:01.572 REL = 00:00:07.9718. DIM = 00:00:00.461 HYB = 00:00:00.731 REL = 00:00:01.882
Other Related Articles
... to read more DBA articles, visit http://dba.fyicenter.com/article/