Many companies have implemented big data applications. These applications consist of a very large data store, hybrid hardware and software to store and access the data, and a sophisticated software interface that accepts the queries of business analysts, accesses the data store, and provides answers that can be used to understand customer needs, simplify business transactions, and increase profitability.
As success stories (and failures) have appeared in the news and technical publications, several myths have emerged about big data. This article explores a few of the more significant myths, and how they may negatively affect your own big data implementation.
Myth #1: Big Data Applications can Stand Alone.
False. Your big data application certainly contains a lot of data. However, of equal importance is the analytics software used to query the data. Analyzing business data is common, especially in companies that already have a data warehouse. The data warehouse contains time-dependent snapshots of operational data, and your current data marts and analytical reports depend upon dimensions in the warehouse.
Dimensions are entities by which an analyst would subset or categorize information. These include time, geography, customer type, store, department, and so forth. A query that sums customer purchases of electronic items for retail stores in several states during the Christmas holiday season includes dimensions of product type (electronic items), stores, geography (state), and time (Christmas holidays). Each dimension gives a different way to summarize data, and may provide clues regarding customer preferences, item availability in stores, or profitability.
Big data applications require such dimensions as well. Since this data is already stored and maintained in your data warehouse, it is natural to integrate the data models of your warehouse and your big data application.
A natural outcome of this integration is that you will be upgrading your data warehouse so that analytical queries can encompass the warehouse data. A good enterprise data model and a comprehensive data dictionary are a necessity.
Warehouse upgrades will include adding new dimensions, inclusion of data from new operational systems, and storage of large objects such as scanned images and XML. This last is especially important, and was mentioned earlier in the discussion on budgeting. Large, complex objects may not be directly analyzable by your business intelligence software package, but basic information about them may be stored in the data warehouse. For example, XML documents can be decoded by some database management systems and stored in a database as tables. This table data may then be analyzed by the BI software.
Myth #2: The Only New Budget Items are Big Data Hardware and Software.
False. Despite some vendorís claims, any IT enterprise implementing a big data application will incur significant costs beyond the investment in big data hardware and software.
First, plan for the near future. Your big data application must have the ability to scale up. This refers to the ability of the system to react to larger volumes of data, faster data transmission speeds, and increasing numbers of users of data-consuming applications. Initial symptoms of this problem will be slower perceived response times, long job run times, and elongated transaction times.
For many applications these issues would be perceived as capacity-related and the response would be to add more CPUs, more memory, and more disk storage. However, in a big data environment more power may not be the answer. Most hybrid hardware and software for big data provided by vendors depends on proprietary data storage methods, including data compression, massive parallel processing, and coordination with the base database management system (DBMS). Scaling up in this environment requires re-thinking the way that your data is architected and stored, including possible denormalization of data, logical partitioning, more intelligent query re-write, and more attention taken to SQL performance analysis.
Next, plan for the medium-term by budgeting for scaling out. Big data stores are fed from operational systems, and such systems today consist of far more than simple character and numeric data. Some systems contain complex data types such as data in extensible markup language (XML), audio and video data, scanned images, and large objects (LOBs). Your big data application may need to analyze these data types while doing aggregations and other operations.
To implement this, you must budget for staff time. Of primary importance is an enterprise data model that spans hardware architectures, as well as data integration across your enterprise.