Tame Big Data using Oracle Data Integration

By:
To read more DBA articles, visit http://dba.fyicenter.com/article/

Purpose
This demo illustrates how you can move and transform all your data using Oracle Data Integration - whether that data resides in Oracle Database, Hadoop, third-party databases, applications, files, or a combination of these sources. The "Design once, run anywhere" paradigm allows you to focus on the logical rules of data transformation and movement while the choice of implementation is separate and can evolve as use cases change and new technologies become available.
This demo is based on the virtual machine Big Data Lite 4.1, which is downloadable from http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
Time to Complete
Approximately one hour
Introduction
This tutorial is divided into the following sections:
Review the Scenario
Ingesting Data Using Sqoop and Oracle GoldenGate
Transforming Data Using Hive
Loading Data to Oracle DB using Oracle Loader for Hadoop
Accessing Hadoop Data from Oracle using Big Data SQL
Transforming Data using Pig
Execute all Steps Using an ODI Package

Scenario
Oracle MoviePlex is an online movie streaming company. Its web site collects every customer interaction in massive JSON formatted log files. It also maintains movie data in a MySQL source database. By unlocking the information contained in these sources and combining it with enterprise data in its data warehouse, the company will be able to enrich its understanding of customer behavior, the effectiveness of product offers, the organization of web site content, and more.

The company is using Oracle's Big Data Management System to unify their data platform and facilitate these analyses, and it is achieved by implementing a Data Reservoir pattern where both structured and unstructured data is collected and staged in a Big Data instance for further analysis and load into target DBs.

Oracle Big Data Management System

Oracle Data Integration provides a unified tool-driven approach to declaratively define integration of all data. The concept of "Design once, run anywhere" means that users can define integration processes regardless of the implementation language and run them in different environments. A transformation that is executed today on an RDBMS can be reused to run the Hadoop cluster and utilize future Hadoop languages and optimizations in the as they become available.

For Oracle MoviePlex, a variety of mechanisms is showcased. Data is loaded from a source database to Hive tables, both in bulk using Sqoop through Oracle Data Integrator(ODI), as well as through change data capture using Oracle GoldenGate(OGG). Data is transformed through joining, filtering and aggregating through Hive or Pig in ODI, and the resulting data can be unloaded into a target Oracle DB using the optimized Oracle Loader for Hadoop(OLH) or Oracle SQL Connector for Hadoop(OSCH). Hadoop data can also be used in Oracle DB using Big Data SQL, where ODI transparently generates the necessary external tables to expose Hive tables in the Oracle DB to be used in SQL queries.

Let's begin the tutorial by reviewing how data is moved and transformed using Oracle Data Integration for Big Data.
Resetting the demo
You can reset the demo environment from a previously run demo or hands-on lab by executing the script:
/home/oracle/movie/moviework/odi/reset_ogg_odi.sh Please note that this script will erase any changes to the ODI repository or target tables.

Part 1 - Ingesting Data Using Sqoop and Oracle GoldenGate
In this section, you will learn how to ingest data from external sources into Hadoop, using Sqoop for bulk load and Oracle GoldenGate for change data capture.
Oracle Big Data Management System
Ingest Bulk Data Using Sqoop in Oracle Data Integrator
Ingest Change Data using Oracle GoldenGate for Big Data
Merge Updates using Hive in Oracle Data Integrator

Part 2 - Transforming Data using Hive
In this section, you will use Oracle Data Integrator to transform the movie data previously loaded with Sqoop and GoldenGate. The use case is to join the table movie with customer activity event data that has been previously loaded into an Avro-encoded HDFS file using Flume and is now exposed as a Hive table movieapp_log_avro. The activity data contains rating actions, we will calculate an average rating for every movie and store the result in a Hive table movie_rating. ODI will generate Hive SQL and execute it in the Hive server engine.
Please note that with ODI you can create logical mappings declaratively without considering any implementation details; those can be added later in the physical design.
Oracle Big Data Management System
Transform Hive Movie Data using ODI Mapping
Open and review the Hive mapping
Open ODI Studio. See Part 1 for information how to start and log into ODI Studio.
Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > C - Calc Ratings (Hive) . Double-click to open the mapping.

The mapping logical view shows the transformations used for the mapping: Source table movie provides information about each movie, while source table movieapp_log_avro contains raw customer activities.
FILTER is being used to filter down to activity = 1 events, which are rating events.
AGGREGATE is used to group all ratings based on movieid and calculate an average of the movie ratings.
JOIN is used to join base movie information from table movie with aggregated events to write to the target table movie_rating.
Switch to the Physical View of the mapping. Since the transformation is within Hive, both source and target datastore are in the same execution unit.
Select the target MOVIE_RATING and review the Properties window, Integration Knowledge Module tab.The IKM Hive Control Append has been selected with default settings except for TRUNCATE=True, which allows repeated execution of the mapping.

Execute the Hive mapping
Press the Run button on the taskbar above the mapping editor. When asked to save your changes, press Yes.

Click OK for the run dialog. We will use all defaults and run this mapping on the local agent that is embedded in the ODI Studio UI. After a moment a Session started dialog will appear, press OK there as well.

Note: The execution can take several minutes depending on the environment. In order to view the generated code only, you can check the checkbox Simulation. In this case the generated session is displayed in a dialog window and no execution is shown in the operator.
To review execution go to the Operator navigator and expand the All Executions node to see the current execution. The execution might not have finished, then it will show the Run icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.

Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 50 - Insert (new) rows to review the generated code in a task editor.

Go to Designer navigator and Models and right-click HiveMovie.movie_rating. Select View Data from the menu to see the loaded rows.

A Data editor appears with all rows of the movie_rating table in Hive.

Part 3 - Loading Data to Oracle DB using Oracle Loader for Hadoop

In this task we load the results of the prior Hive transformation from the resulting Hive table into the Oracle DB data warehouse. We are using the Oracle Loader for Hadoop (OLH) build data loader which uses mechanisms specifically optimized for Oracle DB.
Oracle Big Data Management System
Load Movie data to Oracle DB using ODI Mapping and Oracle Loader for Hadoop

Full article...