Constructing a Data Warehouse

By: Tom Moulder
To read more DBA articles, visit http://dba.fyicenter.com/article/

In today’s business world, data warehouses are increasingly being used to help companies make strategic business decisions. To understand how a warehouse can benefit you and what is required to manage a warehouse, you must first understand how a data warehouse is constructed and established. Some of the challenges associated with implementing a data warehouse are understood by answering these questions:

+ What is the quality of data in the warehouse?
+ What are the definitions of the data elements in the warehouse?
+ How current should the warehouse data be?
+ How can you correlate warehouse data to operational data?
+ What is the physical size of a warehouse?
+ What hardware and software will be used for the warehouse?

What is the Quality of Data in the Warehouse?
Data in a warehouse needs to be accurate and of high quality if business decisions will be made using it. Some important questions to ask in determining the quality of the data are:
+ What is the source of the data?
+ Does it come from an operational application?
+ If so, what are the edit rules and criteria placed on the data before it is allowed to enter the operational application? + What are the incentives for individuals to enter the data correctly?

* This is an important question especially if the information is taken from the Internet. There are noted cases where data from the Internet is replete with errors because individuals like to protect their privacy. E-commerce sites that charge to individual credit cards usually have a higher degree of accuracy in personal information because names can be verified with the credit card companies. Additionally, e-commerce sites that ship to addresses have an even higher degree of information accuracy.

The challenge for the data warehouse is to cleanse the data so that it can be correlated and analyzed appropriately.

Individual names can be problems because many people use name derivations, such as legal names, middle names and nicknames. The same problem exists with address information garnered through operational applications. Those who live in apartments can enter the same information in different formats, which can cause problems. For example, an individual may see the following entries as the same:

Gary T. Moulder
3000 Woodland Way Apt. # 3500
Houston, Texas 77042

Gary Thomas Moulder
3000 Woodland Way
Apt. 3500
Houston, TX 77042

How can this data be cleansed? Programs could be written to look for similar data, make correlations that would determine if similar addresses were for the same individual and format one record for insertion into the warehouse. This would assume that the schema for the warehouse has a column named “address1” and “address2.” Or perhaps, the schema could have columns like “street_number,” “street_name,” “street_type” and “apartment_suite_number.” With this latter situation, the source would have to be broken down into the same components that make up the schema. Though perhaps not as feasible, if the operational application had the same columns, the porting of the data from the operational system to the warehouse would be greatly simplified.

Each business organization must decide for itself how data will be ported to the warehouse. The time and effort required to port the data contrasted with the time and effort to change operational systems to make them more warehouse-friendly are factors involved in that decision. The emergence of software that automates the “data scrubbing” process and integrates this process into the movement of data from an operational system to a warehouse will also affect the business decision. What are the Definitions of the Data Elements in the Warehouse?

The decision of how to port data also involves defining the data contained in the warehouse. Warehouse users may not have a highly technical point of view and may need some definition of the data.

Using the example of the similar addresses, some questions should be asked about the potential queries that will be executed against the warehouse.

Would your users be interested in knowing the business information pertaining to only certain addresses on a particular street? If so, placement of the street number and the street name in separate columns could facilitate this type of query.

Would your users be interested in targeting apartment dwellers with a particular marketing campaign? Placement of the apartment number in a separate column could help in checking for a value that is not null and in verifying the assumption that the person lives in an apartment. This assumption could be validated or invalidated based on data transformation and cleansing. If the data could not always support that assumption, the meaning of the column could be clarified to document the values that would not indicate an apartment dweller. This would allow for the use of an “in-list” in an SQL statement when the users want to limit the query to apartment dwellers.

There are many more possibilities of what users could do when querying a warehouse; such queries show why it is important for data in the warehouse to be easily defined. The more intelligent the query, the more useful the business information revealed will be. Additionally, the warehouse schema should allow the greatest number of questions and answers to be posed, up to the limits of the database management system. How Current Should the Warehouse Be?

Routine maintenance of the warehouse data should be scheduled based on the volatility of the data. An address entered into the system 12 months ago may no longer be current for the targeted individual. However, the warehouse user would not know if the information were correct. Some analysis outside your operational system might determine that in a particular city people who live in apartments move every 18 months on average. This might give you the basis for verifying the customer address information every month or every three months based on the cost of verification. For purposes of targeted marketing and focused campaigns, the data may be perfectly acceptable if it is current to a given week or perhaps even a given month.

Many elaborate schemes have been devised to maintain warehouse data as near to real-time as possible. For instance, when the warehouse is only seconds behind, operational data might provide interesting information for users querying stock transactions. However, day traders may be the only ones interested in this information.

There is a tremendous cost associated with keeping a warehouse current with operational data, and for most companies, the use of the warehouse does not require that the data absolutely match the operational data. The cost is usually not worth the difference in accuracy for user queries. Therefore, if someone is interested in having data that is current within minutes or seconds, some method should be devised to allow users who need real-time updates to the warehouse to bear the cost of the hardware and software required to provide that level of data currency.

Another issue is the storage of time-variable information in a warehouse. For example, a business analyst could use historical data for individuals to correlate the places they have lived in the past to the purchases they have made while living in each location. In this situation, the database must contain a record of every address for every customer and correlate these addresses to the detailed purchase information that is also stored in the warehouse. This is valuable information for a business analyst, but it creates great complexities for the data base administrator designing the data warehouse. How Can You Correlate Warehouse Data to Operational Data?

For all data users to get the maximum benefit from the data warehouse, organizations must decide on a common nomenclature for data elements. For instance, a company could have separate applications for sales and support. In the sales system, the customer name might be stored in a column named “cust_name.” However, in the support application, the customer name might be stored in a column named “company_name.”

Data with inconsistent nomenclature across multiple applications does not develop overnight. It is the result of years of uncoordinated application development across departments and company business lines. Likewise, these problems cannot be resolved overnight or fixed without costs. Many organizations successfully use a data analyst to create a single company-wide view of the business data and then develop a repository of business data that describes all entities, used by the company and their meanings.

In DB2®, one possible solution to minimize the impact of these differences is to implement the use of synonyms so users can refer to business data with the same name they have always used. While this makes it easy on the warehouse user, it adds complexity to the database schema that must be managed by the database administrator. Integration of synonym information into the warehouse applications also allows users to have easy access to their familiar choices of names and avoids the retraining required if many of the element names are changed. What is the Physical Size of a Warehouse?

The questions asked of a warehouse typically require detailed data to arrive at answers. The answers become more accurate as the time span for the query increases. The result is a warehouse of company information that is enormous in size. Currently, hundreds of gigabyte warehouses are common and many companies have multi-terabyte warehouses. This produces unprecedented challenges to the business environment.

From a planning perspective, growth of the warehouse and the projection of future disk drive requirements are a significant challenge. Fortunately, there has been a great deal of competition in the disk drive market resulting in relatively low-cost data storage. The quality of data has increased due to major improvements in the reliability of disk drives and with RAID technology fault tolerance.

However, there are still two important requirements: the physical requirement for storage and the business requirement of just-in-time disk drives that keep excess capacity to a minimum and reduce operating costs for data centers.

As the size of the warehouse increases, there is more pressure to archive data to reduce the warehouse size. This leads to the ultimate questions of:
+ How much data is required online?
+ How much of this data is actually being used for queries?

From the beginning of disk drives, the placement of datasets was a foremost priority for performance. With the new RAID drives, some question whether data placement is as critical as before. A warehouse will use all the capabilities of RAID devices. However, there is still a need for dataset placement depending on the partitioning scheme a customer uses as well as the frequency of access to the data. For example, if the data is partitioned on a monthly scheme and the majority of queries target the last quarter or six months, then it would be advantageous to provide those partitions the fastest access possible. An administrator will have to devise a plan, with the least disruption to the user, to move the data from fast access to slower access when the data is three or six months old. If the partition scheme is monthly, then once per month a partition will cross that threshold. The plan will grow in complexity as the amount of data in the warehouse increases.

A company that has these types of queries – that primarily access the last three to six months – will invariably question whether it needs data that is over six months old. However, a user may want to query the last 24 months of sales information for a targeted marketing campaign immediately after the older data is archived. To manage this, creative administrators may use a mechanism to age the data in DB2 tables with different owners. The critical issue is whether this data will be available when this aging process takes place.

Another issue is the storage of aggregated data in the warehouse. By using summarized information for each day, week or month, the size of the warehouse can be reduced. However, user queries should be specifically coded and tuned to take advantage of this summarized data. And users must realize that summarization will eliminate their ability to answer any question that requires the detailed data. What Hardware and Software Will Be Used for the Warehouse?

There are many choices of hardware and software for data warehouses. Because warehouses grow, the hardware and software chosen should be highly scalable. This will prevent the need for conversion after the warehouse has grown for two years. The use of GUI query tools that hide the underlying database and its server can help somewhat with growth problems. However, it is still necessary to ensure that the query tool will support all the databases that might be required as your data requirements grow. It is also important for the query tool to mask the differences in SQL syntax from database to database.

Full article...