Data Migration From RDBMS To Hadoop: Platform Migration Approach
Data Migration From RDBMS To Hadoop: Platform Migration Approach
Data Migration From RDBMS To Hadoop: Platform Migration Approach
Table of Contents
1 Project Description.......................................................................................................... 1
1.1 Project Abstract ......................................................................................................... 3
1.2 Competitive Information........................................................................................... 3
1.3 Relationship to Other Applications/Projects ............................................................ 3
1.4 Assumptions and Dependencies ............................................................................... 4
1.5 Future Enhancements ................................................................................................ 4
2 Technical Description ..................................................................................................... 6
2.1 Project/Application Architecture .............................................................................. 6
2.2 Project/Application Information flows ..................................................................... 6
2.3 Interactions with other Projects ................................................................................ 7
2.4 Interactions with other Applications ......................................................................... 8
2.5 Capabilities ............................................................................................................. 10
2.6 Risk Assessment and Management ........................................................................ 11
3 Project Requirements .................................................................................................... 12
3.1 Identification of Requirements ............................................................................... 12
3.2 Operations, Administration, Maintenance and Provisioning (OAM&P) ............... 12
4 Project Design Description ........................................................................................... 14
5 Project Internal/external Interface Impacts and Specification .................................. 18
6 Functional overview ..................................................................................................... 19
6.1 .......................................................................................................................... I
mpact .......................................................................................................................... 21
7 Open Issues ................................................................................................................... 23
3
1 ProjectDescription
Oracle, IBM, Microsoft and Teradata own a large portion of the information on the
planet. By that on the off chance that we run an inquiry in any piece of the world, it is
likely that you are perusing the information from a Database possessed by them. The
bigger the volume of information moves from Oracle to DB2 or other is testing
assignment for the business. The conception of Hadoop and NoSQL innovation spoke to
a seismic movement that shook the RDBMS market and offering a different option for
organizations. The Database merchants moved rapidly to Big Data for position and
opposite. Indeed, even everybody has own enormous information innovation like prophet
NoSQL and mongo DB ,There is a colossal business sector for an elite information
movement that can duplicate the information and put away in RDBMS Databases to
Hadoop or NoSQL databases. Current data is available in the RDBMS databases like
oracle, SQL Server, MySQL and Teradata. We are planning to migrate RDBMS data to
big data which is support NoSQL database and contains verity of data from the existed
system it’s take huge resources and time to migrate pita bytes of data. Time and resource
may be constraints for the current migrating process.
A traditional RDBMS is used to handle relational data. Hadoop works well with
structured as well as unstructured data, and supports various serialization and data
formats for example Text, Json, Xml, Avro etc. I would say there are problems where
SQL databases are a perfect choice. if your data size permits it and your data type is
relational, you are fine to use the RDBMS approach. Its worked well in the past,
its a mature technology and it has its needs. Where the data size or type
is such that you are unable to save it in an RDBMS, go for solutions like Hadoop. One
such example is a product catalog. A car has different attributes than a television. It is
tough to create a new table per product type. Another example is machine generated data.
in this case the data size creates a big pressure on the traditional RDBMS. Thats a classic
Hadoop problem. Or document indexing. There are various such examples.
In CDH 3, all of the Hadoop API implementations were confined to a single JAR file
(hadoop-core) plus a few of its dependencies. It was relatively straightforward to make sure
that classes from these JAR files were available at runtime.
CDH 4 and CDH 5 are more complex: they bundle both MRv1 and MRv2 (YARN). To
simplify things, CDH 4 and CDH 5 provide a Maven-based way of managing client-side
Hadoop API dependencies that saves you from having to figure out the exact names and
locations of all the JAR files needed to provide Hadoop APIs.
In CDH 5, Cloudera recommends that you use a hadoop-client artifact for all clients,
instead of managing JAR-file-based dependencies manually.
The big area for Hadoop improvement is modularity, pluggability, and coexistence, on
6
both the storage and application execution tiers. For example:
Greenplum/MapR and Hadapt both think you should have HDFS file
management and relational DBMS coexisting on the same storage nodes. (I
agree.)
Part of what Hortonworks calls “Phase 2″ sets out to ensure that Hadoop can
properly manage temp space and so on next to HDFS.
Perhaps HBase won’t always assume HDFS.
DataStax thinks you should blend HDFS and Cassandra.
Meanwhile, Pig and Hive need to come closer together. Often you want to stream data
into Hadoop. The argument that MPI trumps MapReduce does, in certain use cases,
make sense. Apache Hadoop “Phase 2″ and beyond are charted to accommodate some
of those possibilities too.
Hadoop is an open-source software framework for storing data and running applications
on clusters of commodity hardware. It provides massive storage for any kind of data,
enormous processing power and the ability to handle virtually limitless concurrent tasks
or jobs.
Massive storage. The Hadoop framework breaks big data into blocks, which are
stored on clusters of commodity hardware.
Processing power. Hadoop concurrently processes large amounts of data using
multiple low-cost computers for fast results.
2 Technical Description
Application Information flows have three steps strategy to move data from RDMS to
HIVE on Hadoop.
Flows have three steps strategy to move data from RDMS to HIVE on Hadoop.
Depending on whether direct access is available to the RDBMS source system, you
may opt for either a File Processing method (when no direct access is available) or
RDBMS Processing (when database client access is available).
Regardless of the ingest option, the processing workflow in this article requires:
1. One-time, initial load to move all data from source table to HIVE.
10
For this blog, we assume that a file or set of files within a folder will have a delimited
format and will have been generated from a relational system (i.e. records have unique
keys or identifiers).
Files will need to be moved into HDFS using standard ingest options:
Once the initial set of records are moved into HDFS, subsequent scheduled events can
move files containing only new Inserts and Updates.
SQOOP is the JDBC-based utility for integrating with traditional databases. A SQOOP
Import allows for the movement of data into either HDFS (a delimited format can be
defined as part of the Import definition) or directly into a Hive table.
In the background, the source file is split into HDFS blocks, the size of which is
configurable (commonly 128 MB, 64 MB by default). For fault tolerance, each block is
automatically replicated by HDFS. By default, three copies of each block are written to
three different DataNodes. The replication factor is user-configurable (default is three).
The DataNodes are servers which are physical machines or virtual machines/cloud
instances. DataNodes form the Hadoop cluster into which you write your data and on
which you run your MapReduce/Hive/Pig/Impala/Mahout/etc. programs.TheDataNodes
are the workers of the Hadoop cluster, the NameNodes are the masters.
11
When a file is to be written into HDFS, the client writing the file obtains from the
NameNode a list of DataNodes that can host replicas of the first block of the file.The
client arranges a pipeline through which all bytes of data from the first block of the
source file will be transmitted to all participating DataNodes. The pipeline is formed from
client to first DataNode to second DataNode to final (third in our case) DataNode. The
data is split into packets for transmission, and each packet is tracked until all DataNodes
return acks to indicate successful replication of the data. The packets are streamed to the
first DataNode in the pipeline, which stores the packet and forwards it to the second
DataNode, and so on. If one or more replications fail, the infrastructure automatically
constructs a new pipeline and retries the copy.
When all three DataNodes confirm successful replication, the client will advance to the
next block, again request a list of host DataNodes from the NameNode, and construct a
new pipeline. This process is followed until all blocks have been copied into HDFS. The
final block written may be smaller than the configured block size, but all blocks from the
first to the penultimate block will be of the configured block size.
The tables in Hive are similar to tables in a relational database, and data units are
organized in a taxonomy from larger to more granular units. Databases are comprised of
tables, which are made up of partitions. Data can be accessed via a simple query language
and Hive supports overwriting or appending data.
12
Within a particular database, data in the tables is serialized and each table has a
corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-
divided into partitions that determine how data is distributed within sub- directories of the
table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY,
BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING,
TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to
form complex data types, such as structs, maps and arrays.
Ability to create and manage tables and partitions (create, drop and alter).
Ability to support various Relational, Arithmetic and Logical Operators.
Ability to do various joins between two tables.
Ability to evaluate functions like aggregations on multiple “group by” columns in a
table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local directory.
Ability to create an external table that points to a specified location within HDFS
13
Ability to store the results of a query in an HDFS directory.
Ability to plug in custom scripts using the language of choice for custom map/reduce
jobs.
Hive aims to provide acceptable (but not optimal) latency for interactive data browsing,
queries over small data sets or test queries.
Hive Applications:
Log processing
Text mining
Document indexing
Customer-facing business intelligence (e.g., Google Analytics)
Predictive modeling, hypothesis testing
Data is the new currency of the modern world. Businesses that successfully maximize its
value will have a decisive impact on their own value and on their customers’ success. As
the de-facto platform for big data, Apache Hadoop allows businesses to create highly
scalable and cost-efficient data stores. Organizations can then run massively parallel and
high-performance analytical workloads on that data, unlocking new insight previously
hidden by technical or economic limitations. Hadoop offers data value at unprecedented
scale and efficiency -- in part thanks to Apache Tez and YARN.
Analytic applications perform data processing in purpose-driven ways that are unique to
specific business problems or vendor products. There are two prerequisites to creating
purpose-built applications for Hadoop data access. The first is an "operating system"
(somewhat akin to Windows or Linux) that can host, manage, and execute these
applications in a shared Hadoop environment. Apache YARN is that data operating
system for Hadoop. The second prerequisite is an application-building framework and a
common standard that developers can use to write data access applications that run on
15
YARN.
Apache Tez meets this second need. Tez is an embeddable and extensible framework that
enables easy integration with YARN and allows developers to write native YARN
applications that bridge the spectrum of interactive and batch workloads. Tez leverages
Hadoop's unparalleled ability to process petabyte-scale datasets, allowing projects in the
Apache Hadoop ecosystem to express fit-to-purpose data processing logic, yielding fast
response times and extreme throughput. Tez brings unprecedented speed and scalability
to Apache projects like Hive and Pig, as well as to a growing field of third-party software
applications designed for high-speed interaction with data stored in Hadoop.
Those familiar with MapReduce will wonder how Tez is different. Tez is a broader, more
powerful framework that maintains MapReduce’s strengths while overcoming some of its
limitations. Tez retains the following strengths from MapReduce:
Fault tolerance and recovery from inevitable and common failures in distributed systems
Secure data processing using built-in Hadoop security mechanisms
But Tez is not an engine by itself. Rather, Tez provides common primitives for building
applications and engines -- thus, its flexibility and customizability. Developers can write
MapReduce jobs using the Tez library, and Tez comes with a built-in implementation of
MapReduce, which can be used to run any existing MapReduce job with Tez efficiency.
MapReduce was (and is) ideal for Hadoop users that simply want to start using Hadoop
with minimal effort. Now that enterprise Hadoop is a viable, widely accepted platform,
organizations are investing to extract the maximum value from data stored in their
clusters. As a result, customized applications are replacing general-purpose engines such
as MapReduce, bringing about greater resource utilization and improved performance.
Once the application logic has been defined via this graph, Tez parallelizes the logic and
executes it in Hadoop. If a data- processing application can be modeled in this manner, it
can likely be built with Tez. Extract-Transform-Load (ETL) jobs are a common form of
Hadoop data processing, and any custom ETL application is a perfect fit for Tez. Other
good matches are query-processing engines like Apache Hive, scripting languages like
Apache Pig, and language-integrated, data processing APIs like Cascading for Java and
Scalding for Scala.
2.5 Capabilities
As the amount of data continues to grow exponentially, data scientists increasingly need
the ability to perform full-fidelity analysis of that data at massive scale. Cloudera
recognized the importance of the Python language in modern data engineering and data
science and how, thanks to its use of more complex workflows, it has become a primary
language for data transformation and interactive analysis. Python development has been
confined to local data processing and smaller data sets, requiring data scientists to make
many compromises when attempting to work with big data. Using Ibis, a new open
17
source data analysis framework, Python users will finally be able to process
data at scale without compromising user experience or performance.
The initial version of Ibis provides an end-to-end Python experience with comprehensive
support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling,
and analytics. Upcoming versions will allow users to leverage the full range of Python
packages as well as express efficient custom logic using Python. By integrating with
Impala, the leading MPP database engine for Hadoop, Ibis can achieve the interactive
performance and scalability necessary for big data.
18
“With its usability, extensibility and robust third-party library ecosystem, it’s easy to
understand why Python is the open source language of choice for so many data scientists.
However, we recognize its limitation – where it’s unable to achieve high performance at
Hadoop-scale,” said Wes McKinney. “With Ibis, our vision is to provide a first-class
Python experience on large scalable architectures like Hadoop, with full access to the
ecosystem of Python tools.
XML processing
While many traditional databases provide XML support, the XML content must first be
loaded into the database. Because Hive tables can be linked to a collection of XML files
or document fragments stored in the Hadoop file system, Hadoop is much more flexible
in analyzing XML content.
In addition to XPath operators, the Hive query language offers several ways to work with
common web and text data. Tableau exposes the following functions that you can use in
calculated fields:
On-the-fly ETL
Custom SQL gives you the flexibility of using arbitrary queries for your connection,
which allows complex join conditions, pre-filtering, pre-aggregation and more.
Traditional databases rely heavily on optimizers, but they can struggle with complex
Custom SQL and lead to unexpected performance degradation as you build views. The
batch-oriented nature of Hadoop allows it to handle layers of analytical queries on top of
complex Custom SQL with only incremental increases to query time.
Because Custom SQL is a natural fit for the complex layers of data transformations seen
in ETL, a Tableau connection to Hive based on Custom SQL is essentially on-the-fly
19
ETL. Refer to the Hadoop and Tableau Demo on the Tableau blog to see how you can
use Custom SQL to unpivot nested XML data directly in the Custom SQL connection,
yielding views built from the unpivoted data.
Initial SQL
Tableau supports initial SQL for Hadoop Hive connections, which allows you to define a
collection of SQL statements to perform immediately after the connection is established.
For example, you can set Hive and Hadoop configuration variables for a given
connection from Tableau to tune performance characteristics. Refer to the Designing for
Performancearticle for more information.
Similarly, Tableau allows you to take advantage of custom UDFs and UDAFs built by
the Hadoop community or by your own development team. Often these are built as JAR
files that Hadoop can easily copy across the cluster to support distributed computation.
To take advantage of JAR files or scripts, inform Hive of the location of these files and
Hive will take care of the rest.
20
3 Project Requirements
Hive looks very much like traditional database code with SQL access. However, because
Hive is based on Hadoop and Map Reduce operations, there are several key differences.
The first is that Hadoop is intended for long sequential scans, and because Hive is based
on Hadoop, you can expect queries to have a very high latency (many minutes). This
means that Hive would not be appropriate for applications that need very fast response
times, as you would expect with a database such as DB2. Finally, Hive is read-based and
therefore not appropriate for transaction processing that typically involves a high
percentage of write operations.
Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into Map Reduce jobs executed on Hadoop. In addition, HiveQL supports custom
Map Reduce scripts to be plugged into queries. Hive also enables data
serialization/deserialization and increases flexibility in schema design by including a
system catalog called Hive-Metastore.
Hive supports text files (also called flat files), Sequence Files (flat files consisting of
binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a
table in a columnar database way.)
Finally, one of the fraud prevention problems is latency. The agencies want to react to an
event as soon as possible, often within a few minutes of the event. Yahoo recently
reported that it can adjust its behavioral model in a response to a user click event within
5-7 minutes across several hundred of millions of customers and billions of events per
day. Cloudera has developed a tool, Flume, that can load billions of events into HDFS
within a few seconds and analyze them using MapReduce.
22
Often fraud detection is akin to “finding a needle in a haystack”. One has to go through
mountains of relevant and seemingly irrelevant information, build dependency models,
evaluate the impact and thwart the fraudster actions. Hadoop helps with finding patterns
by processing mountains of information on thousands of cores in a relatively short
amount of time.
As fraud and security breaches are becoming more frequent and sophisticated, traditional
security solutions are not able to protect company assets. MapR enables organizations to
analyze unlimited amounts and types of data in real time, widen the scale and accelerate
the speed of threat analysis, and improve risk assessment by building sophisticated
machine learning models. Specific use cases include protection against infrastructure
risks as well as consumer-oriented risks across different industries:
Security Information and Event Management (SIEM): Analyze and correlate large
amounts of real-time data from network and security devices to manage external and
internal security threats, improve incident response time and compliance reporting.
Application Log Monitoring: Improve analysis of application log data to better
manage system resource utilization, security issues, and diagnose and preempt
production application problems.
Network Intrusion Detection: Monitor and analyze network traffic to detect, identify,
and report on suspicious activity or intruders.
Fraud Detection: Use pattern/anomaly recognition on larger volumes and variety of
data to detect and prevent fraudulent activities by external or internal parties.
MapR Advantages:
Easy data ingestion: Copying data to and from the MapR cluster is as simple as
copying data to a standard file system using Direct Access NFS™. Applications can
therefore ingest data into the Hadoop cluster in real time without any staging areas or
separate clusters just to ingest data.
Existing applications work: Due to the MapR platform's POSIX compliance, any non-
Java application works directly on MapR without undergoing code changes. Existing
toolsets, custom utilities and applications are good to go on day one.
23
Multi-tenancy: Support multiple user groups, any and all enterprise data sets, and
multiple applications in the same cluster. Data modelers, developers and analysts can
all work in unison on the same cluster without stepping on each other's toes.
Business continuity: MapR provides integrated high availability (HA), data protection,
and disaster recovery (DR) capabilities to protect against both hardware failure as well
as site-wide failure.
High scalability: Scalability is key to bringing all data together on one platform so the
analytics are much more nuanced and accurate. MapR is the only platform that scales
all the way to a trillion files without compromising performance.
High performance: The MapR Distribution for Hadoop was designed for high
performance, with respect to both high throughput and low latency. In addition, a
fraction of servers are required for running MapR versus other Hadoop distributions,
leading to architectural simplicity and lower capital and operational expenses.
24
4 Project Description
Data flows have three steps strategy to move data from RDMS to HIVE on Hadoop.
Low Cost :
Hadoop is an free open Source frame work, and uses commodity hardware to store
substantial amount of data. Hadoop additionally offers a practical stockpiling answer for
organizations' blasting information sets. The issue with customary social database
administration frameworks is that it is amazingly taken a toll restrictive to scale to such
an extent so as to process such huge volumes of information. With an end goal to lessen
costs, numerous organizations in the past would have needed to down-example
information and characterize it in view of specific presumptions as to which information
was the most significant. The crude information would be erased, as it would be
excessively taken a toll restrictive, making it impossible to keep. While this methodology
may have worked in the short term, this implied that when business needs changed, the
complete crude information set was not accessible, as it was so costly it was not possible
store. Hadoop, then again, is outlined as a scale-out structural planning that can
reasonably store the majority of an organization's information for later utilize. The
expense reserve funds are stunning: as opposed to costing thousands to a huge number of
pounds every terabyte, Hadoop offers figuring and stockpiling capacities for many
pounds every terabyte.
Flexible:
Hadoop empowers organizations to effectively get to new information sources and tap
into distinctive sorts of information (both organized and unstructured) to create esteem
from that information. This implies organizations can utilize Hadoop to get profitable
business bits of knowledge from information sources, for example, social networking,
email discussions information. Moreover, Hadoop can be utilized for a wide mixed bag
of purposes, for example, log handling, suggestion frameworks,information warehousing,
business
20 sector battle investigation and misrepresentation discovery.
Computing power:
Hadoop is a distributed processing model, so it can prepare vast measure of
information. The additionally registering hubs you utilize.
Fast:
Hadoop’s unique storage method is based on a distributed file system that basically
‘maps’ data wherever it is located on a cluster. The tools for data processing are often on
the same servers where the data is located, resulting in much faster data processing. If
you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently
process terabytes of data in just minutes petabytes in hours.
Storage Flexibility:
Unlike traditional relational databases, you don’t have to preprocess data before storing
it. And that includes unstructured data like text, images and videos. You can store as
much data as you want and
decide how to use it later.Inherent data protection and self-healing capabilities:
Data and application processing are protected against hardware failure. If a node goes
down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. And it automatically stores multiple copies of all data.