Data Migration From RDBMS To Hadoop: Platform Migration Approach

2
Data Migration from RDBMS to

Hadoop: Platform migration
approach
Table of Contents
1 Project Description.......................................................................................................... 1
1.1 Project Abstract ......................................................................................................... 3
1.2 Competitive Information........................................................................................... 3
1.3 Relationship to Other Applications/Projects ............................................................ 3
1.4 Assumptions and Dependencies ............................................................................... 4
1.5 Future Enhancements ................................................................................................ 4
2 Technical Description ..................................................................................................... 6
2.1 Project/Application Architecture .............................................................................. 6
2.2 Project/Application Information flows ..................................................................... 6
2.3 Interactions with other Projects ................................................................................ 7
2.4 Interactions with other Applications ......................................................................... 8
2.5 Capabilities ............................................................................................................. 10
2.6 Risk Assessment and Management ........................................................................ 11
3 Project Requirements .................................................................................................... 12
3.1 Identification of Requirements ............................................................................... 12
3.2 Operations, Administration, Maintenance and Provisioning (OAM&P) ............... 12
4 Project Design Description ........................................................................................... 14
5 Project Internal/external Interface Impacts and Specification .................................. 18
6 Functional overview ..................................................................................................... 19
6.1 .......................................................................................................................... I
mpact .......................................................................................................................... 21
7 Open Issues ................................................................................................................... 23
3
1 ProjectDescription
1.1 Project Abstract
Oracle, IBM, Microsoft and Teradata own a large portion of the information on the
planet. By that on the off chance that we run an inquiry in any piece of the world, it is
likely that you are perusing the information from a Database possessed by them. The
bigger the volume of information moves from Oracle to DB2 or other is testing
assignment for the business. The conception of Hadoop and NoSQL innovation spoke to
a seismic movement that shook the RDBMS market and offering a different option for
organizations. The Database merchants moved rapidly to Big Data for position and
opposite. Indeed, even everybody has own enormous information innovation like prophet
NoSQL and mongo DB ,There is a colossal business sector for an elite information
movement that can duplicate the information and put away in RDBMS Databases to
Hadoop or NoSQL databases. Current data is available in the RDBMS databases like
oracle, SQL Server, MySQL and Teradata. We are planning to migrate RDBMS data to
big data which is support NoSQL database and contains verity of data from the existed
system it’s take huge resources and time to migrate pita bytes of data. Time and resource
may be constraints for the current migrating process.
1.2 Relationship to Other Applications/Projects

Normally folks use NoSQL DBs(like HBase, Cassandra) with Hadoop. Using these
DBs with hadoop is merely a matter of configuration. You don't need any connecting
program in order to achieve this. There are few other reasons behind choosing NoSQL
DBs in place of SQL DBs. One thing is size. These NoSQL DBs provided great
horizontal scalability which enables you to store PBs of data easily. You could scale
traditional systems, but vertically. Another reason for complexityof data. The places,
where these DBs are being used, mostly handle highly unstructured data which is not
very easy to deal with using traditional systems. For example, sensor data, log data etc.
Basically, I did not understand why sqoop exists. Why can’t we directly use an SQL data
on Hadoop.
Although Hadoop is very good at handling your BigData needs, it is not the solution to
all your needs. It is not suitable for real-time needs. Suppose you are an Online
4
Transaction Company with very very huge dataset. You find out that you could process
this data very easily using Hadoop. But the problem is that you can't serve the real-time
needs of you customers with Hadoop. This is where SQOOP comes into picture. It is an
import/export tool that allows you to move data between a SQL DB and Hadoop. You
could move your BigData into your Hadoop cluster, process it there and then push the
results back into your SQL DB using SQOOP to serve the real-time needs of your
customers.
5
A traditional RDBMS is used to handle relational data. Hadoop works well with
structured as well as unstructured data, and supports various serialization and data
formats for example Text, Json, Xml, Avro etc. I would say there are problems where
SQL databases are a perfect choice. if your data size permits it and your data type is
relational, you are fine to use the RDBMS approach. Its worked well in the past,
its a mature technology and it has its needs. Where the data size or type
is such that you are unable to save it in an RDBMS, go for solutions like Hadoop. One
such example is a product catalog. A car has different attributes than a television. It is
tough to create a new table per product type. Another example is machine generated data.
in this case the data size creates a big pressure on the traditional RDBMS. Thats a classic
Hadoop problem. Or document indexing. There are various such examples.
1.3 Assumptions and Dependencies
In CDH 3, all of the Hadoop API implementations were confined to a single JAR file
(hadoop-core) plus a few of its dependencies. It was relatively straightforward to make sure
that classes from these JAR files were available at runtime.
CDH 4 and CDH 5 are more complex: they bundle both MRv1 and MRv2 (YARN). To
simplify things, CDH 4 and CDH 5 provide a Maven-based way of managing client-side
Hadoop API dependencies that saves you from having to figure out the exact names and
locations of all the JAR files needed to provide Hadoop APIs.
In CDH 5, Cloudera recommends that you use a hadoop-client artifact for all clients,
instead of managing JAR-file-based dependencies manually.
 Flavors of the hadoop-client Artifact

 Versions of the hadoop-client Artifact
 Using hadoop-client for Maven-based Java Projects
 Using hadoop-client for Ivy-based Java Projects
 Using JAR Files Provided in the hadoop-client Package
1.4 Future Enhancements
The big area for Hadoop improvement is modularity, pluggability, and coexistence, on
6
both the storage and application execution tiers. For example:
Greenplum/MapR and Hadapt both think you should have HDFS file
management and relational DBMS coexisting on the same storage nodes. (I
agree.)
 Part of what Hortonworks calls “Phase 2″ sets out to ensure that Hadoop can
properly manage temp space and so on next to HDFS.
 Perhaps HBase won’t always assume HDFS.
 DataStax thinks you should blend HDFS and Cassandra.
Meanwhile, Pig and Hive need to come closer together. Often you want to stream data
into Hadoop. The argument that MPI trumps MapReduce does, in certain use cases,
make sense. Apache Hadoop “Phase 2″ and beyond are charted to accommodate some
of those possibilities too.
1.5 Definitions and Acronyms
ETL.. Ectract, Transform

and Load SQL.. Structured
Query Language HDFS..
Hadoop Distributed File
System XML.. Extensible
Markup Language GUI..
Graphical User Interphase
HSQLDB.. Hyper SQL
Database
NFS.. Network File
System CLI..
Command Line
API.. Application
Programming Interphases
JDBC.. Java Database
Connectivity ODBC..Open
Database Connectivity
SIEM.. Security Information Event
Management Vm.. Virtual Machine
7
Hadoop is an open-source software framework for storing data and running applications
on clusters of commodity hardware. It provides massive storage for any kind of data,
enormous processing power and the ability to handle virtually limitless concurrent tasks
or jobs.
 Open-source software. Open-source software is created and maintained by a

network of developers from around the globe. It's free to download, use and
contribute to, though more and more commercial versions of Hadoop are
becoming available.
 Framework. In this case, itmeans that everything you need to develop and run
software applications is provided – programs, connections, etc.
8
 Massive storage. The Hadoop framework breaks big data into blocks, which are
stored on clusters of commodity hardware.
 Processing power. Hadoop concurrently processes large amounts of data using
multiple low-cost computers for fast results.
2 Technical Description
By utilizing Sqoop we will import information from a social database framework

into HDFS. The info to the import procedure is a database table. Sqoop will read the
table column by-line into HDFS. The yield of this import procedure is an arrangement
of documents containing a duplicate of the foreign made table. The import procedure is
performed in parallel. Thus, the yield will be in different documents. These documents
may be delimited content records or paired .A by-result of the import procedure is a
created Java class which can epitomize one line of the foreign made table. This class is
utilized amid the import process by Sqoop itself.In the wake of controlling the foreign
records with Hive we will have outcome information set which you can then fare back
to the social database. Sqoop's fare procedure will read an arrangement of delimited
content documents from HDFS in parallel, parse them into records, and supplement
them as new lines in an objective database table, for utilization by outer applications or
clients.
2.1 Project/Application Architecture

9
2.2 Project/Application Information flows:
Application Information flows have three steps strategy to move data from RDMS to
HIVE on Hadoop.
Flows have three steps strategy to move data from RDMS to HIVE on Hadoop.
Depending on whether direct access is available to the RDBMS source system, you
may opt for either a File Processing method (when no direct access is available) or
RDBMS Processing (when database client access is available).
Regardless of the ingest option, the processing workflow in this article requires:
1. One-time, initial load to move all data from source table to HIVE.
10
2. On-going, “Change Only” data loads from the source
table to HIVE. Below, both File Processing and Database-
direct (SQOOP) ingest will be discussed.
2.3 File Processing
Step 1: Converting RDMS Data into Hadoop
For this blog, we assume that a file or set of files within a folder will have a delimited
format and will have been generated from a relational system (i.e. records have unique
keys or identifiers).
Files will need to be moved into HDFS using standard ingest options:
 WebHDFS: Primarily used when integrating with applications, a Web URL

provides an Upload end-point into a designated HDFS folder.
 NFS: Appears as a standard network drive and allows end-users to use standard
Copy-Paste operations to move files from standard file systems into HDFS.
Once the initial set of records are moved into HDFS, subsequent scheduled events can
move files containing only new Inserts and Updates.
SQOOP is the JDBC-based utility for integrating with traditional databases. A SQOOP
Import allows for the movement of data into either HDFS (a delimited format can be
defined as part of the Import definition) or directly into a Hive table.
Step 2: Store file into Hadoop cluster
In the background, the source file is split into HDFS blocks, the size of which is
configurable (commonly 128 MB, 64 MB by default). For fault tolerance, each block is
automatically replicated by HDFS. By default, three copies of each block are written to
three different DataNodes. The replication factor is user-configurable (default is three).
The DataNodes are servers which are physical machines or virtual machines/cloud
instances. DataNodes form the Hadoop cluster into which you write your data and on
which you run your MapReduce/Hive/Pig/Impala/Mahout/etc. programs.TheDataNodes
are the workers of the Hadoop cluster, the NameNodes are the masters.
11
When a file is to be written into HDFS, the client writing the file obtains from the
NameNode a list of DataNodes that can host replicas of the first block of the file.The
client arranges a pipeline through which all bytes of data from the first block of the
source file will be transmitted to all participating DataNodes. The pipeline is formed from
client to first DataNode to second DataNode to final (third in our case) DataNode. The
data is split into packets for transmission, and each packet is tracked until all DataNodes
return acks to indicate successful replication of the data. The packets are streamed to the
first DataNode in the pipeline, which stores the packet and forwards it to the second
DataNode, and so on. If one or more replications fail, the infrastructure automatically
constructs a new pipeline and retries the copy.
When all three DataNodes confirm successful replication, the client will advance to the
next block, again request a list of host DataNodes from the NameNode, and construct a
new pipeline. This process is followed until all blocks have been copied into HDFS. The
final block written may be smaller than the configured block size, but all blocks from the
first to the penultimate block will be of the configured block size.
Step 3: Read Data from HIVE
The tables in Hive are similar to tables in a relational database, and data units are
organized in a taxonomy from larger to more granular units. Databases are comprised of
tables, which are made up of partitions. Data can be accessed via a simple query language
and Hive supports overwriting or appending data.
12
Within a particular database, data in the tables is serialized and each table has a
corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-
divided into partitions that determine how data is distributed within sub- directories of the
table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY,
BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING,
TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to
form complex data types, such as structs, maps and arrays.
2.4 Interactions with other Applications

Hive is a data warehousing infrastructure built on top of apache Hadoop.
Hadoop provides massive scale-out and fault-tolerance capabilities for data storage and
processing (using the MapReduce programming paradigm) on commodity hardware.
Hive enables easy data summarization, ad-hoc querying and
analysis of large volumes of data. It is best used for batch jobs over
large sets of immutable data (like web logs).
It provides a simple query language called Hive QL, which is based on SQL and which
enables users familiar with SQL to easily perform ad-hoc querying, summarization and
data analysis.
At the same time, Hive QL also allows traditional MapReduce programmers to be able to
plug in their custom mappers and reducers to do more sophisticated analysis that may not
be supported by the built-in capabilities of the languag
Hive Query Language capabilities:
Hive query language provides the basic SQL like operations. These operations work on
tables or partitions.
 Ability to create and manage tables and partitions (create, drop and alter).
 Ability to support various Relational, Arithmetic and Logical Operators.
 Ability to do various joins between two tables.
 Ability to evaluate functions like aggregations on multiple “group by” columns in a
table.
 Ability to store the results of a query into another table.
 Ability to download the contents of a table to a local directory.
 Ability to create an external table that points to a specified location within HDFS
13
 Ability to store the results of a query in an HDFS directory.
 Ability to plug in custom scripts using the language of choice for custom map/reduce
jobs.
Major Components of Hive and its interaction with Hadoop:

Hive provides external interfaces like command line (CLI) and web UI, and application
programming interfaces (API) like JDBC and ODBC. The Hive Thrift Server exposes a
very simple client API to execute HiveQL statements. Thrift is a framework for cross-
language services, where a server written in one language (like Java) can also support
clients in other languages.
The Metastore is the system catalog. All other components of Hive interact with the
Metastore.
The Driver manages the life cycle of a HiveQL statement during compilation, optimization
and execution.
The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler
translates this statement into a plan which consists of a DAG of map/reduce jobs.
The driver submits the individual map/reduce jobs from the DAG to the Execution
Engine in a topological order. Hive currently uses Hadoop as its execution engine.
What Hive is NOT
Hive is not designed for online transaction processing and does not offer real-time queries
and row-level updates.
14
Hive aims to provide acceptable (but not optimal) latency for interactive data browsing,
queries over small data sets or test queries.
Hive Applications:
 Log processing
 Text mining
 Document indexing
 Customer-facing business intelligence (e.g., Google Analytics)
 Predictive modeling, hypothesis testing
Data is the new currency of the modern world. Businesses that successfully maximize its
value will have a decisive impact on their own value and on their customers’ success. As
the de-facto platform for big data, Apache Hadoop allows businesses to create highly
scalable and cost-efficient data stores. Organizations can then run massively parallel and
high-performance analytical workloads on that data, unlocking new insight previously
hidden by technical or economic limitations. Hadoop offers data value at unprecedented
scale and efficiency -- in part thanks to Apache Tez and YARN.
Analytic applications perform data processing in purpose-driven ways that are unique to
specific business problems or vendor products. There are two prerequisites to creating
purpose-built applications for Hadoop data access. The first is an "operating system"
(somewhat akin to Windows or Linux) that can host, manage, and execute these
applications in a shared Hadoop environment. Apache YARN is that data operating
system for Hadoop. The second prerequisite is an application-building framework and a
common standard that developers can use to write data access applications that run on
15
YARN.
Apache Tez meets this second need. Tez is an embeddable and extensible framework that
enables easy integration with YARN and allows developers to write native YARN
applications that bridge the spectrum of interactive and batch workloads. Tez leverages
Hadoop's unparalleled ability to process petabyte-scale datasets, allowing projects in the
Apache Hadoop ecosystem to express fit-to-purpose data processing logic, yielding fast
response times and extreme throughput. Tez brings unprecedented speed and scalability
to Apache projects like Hive and Pig, as well as to a growing field of third-party software
applications designed for high-speed interaction with data stored in Hadoop.
Hadoop in a post-MapReduce world
Those familiar with MapReduce will wonder how Tez is different. Tez is a broader, more
powerful framework that maintains MapReduce’s strengths while overcoming some of its
limitations. Tez retains the following strengths from MapReduce:
 Horizontal scalability with increasing data size and compute capacity

 Resource elasticity to work both when capacity is abundant and when it’s limited
16
 Fault tolerance and recovery from inevitable and common failures in distributed systems
 Secure data processing using built-in Hadoop security mechanisms
But Tez is not an engine by itself. Rather, Tez provides common primitives for building
applications and engines -- thus, its flexibility and customizability. Developers can write
MapReduce jobs using the Tez library, and Tez comes with a built-in implementation of
MapReduce, which can be used to run any existing MapReduce job with Tez efficiency.
MapReduce was (and is) ideal for Hadoop users that simply want to start using Hadoop
with minimal effort. Now that enterprise Hadoop is a viable, widely accepted platform,
organizations are investing to extract the maximum value from data stored in their
clusters. As a result, customized applications are replacing general-purpose engines such
as MapReduce, bringing about greater resource utilization and improved performance.
The Tez design philosophy

Apache Tez is optimized for such customized data-processing applications running in
Hadoop. It models data processing as a data flow graph, so projects in the Apache
Hadoop ecosystem can meet requirements for human-interactive response times and
extreme throughput at petabyte scale. Each node in the data flow graph represents a bit of
business logic that transforms or analyzes data. The connections between nodes represent
movement of data between different transformations.
Once the application logic has been defined via this graph, Tez parallelizes the logic and
executes it in Hadoop. If a data- processing application can be modeled in this manner, it
can likely be built with Tez. Extract-Transform-Load (ETL) jobs are a common form of
Hadoop data processing, and any custom ETL application is a perfect fit for Tez. Other
good matches are query-processing engines like Apache Hive, scripting languages like
Apache Pig, and language-integrated, data processing APIs like Cascading for Java and
Scalding for Scala.
2.5 Capabilities
As the amount of data continues to grow exponentially, data scientists increasingly need
the ability to perform full-fidelity analysis of that data at massive scale. Cloudera
recognized the importance of the Python language in modern data engineering and data
science and how, thanks to its use of more complex workflows, it has become a primary
language for data transformation and interactive analysis. Python development has been
confined to local data processing and smaller data sets, requiring data scientists to make
many compromises when attempting to work with big data. Using Ibis, a new open
17
source data analysis framework, Python users will finally be able to process
data at scale without compromising user experience or performance.
The initial version of Ibis provides an end-to-end Python experience with comprehensive
support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling,
and analytics. Upcoming versions will allow users to leverage the full range of Python
packages as well as express efficient custom logic using Python. By integrating with
Impala, the leading MPP database engine for Hadoop, Ibis can achieve the interactive
performance and scalability necessary for big data.
18
“With its usability, extensibility and robust third-party library ecosystem, it’s easy to
understand why Python is the open source language of choice for so many data scientists.
However, we recognize its limitation – where it’s unable to achieve high performance at
Hadoop-scale,” said Wes McKinney. “With Ibis, our vision is to provide a first-class
Python experience on large scalable architectures like Hadoop, with full access to the
ecosystem of Python tools.
XML processing
While many traditional databases provide XML support, the XML content must first be
loaded into the database. Because Hive tables can be linked to a collection of XML files
or document fragments stored in the Hadoop file system, Hadoop is much more flexible
in analyzing XML content.
Web and text processing
In addition to XPath operators, the Hive query language offers several ways to work with
common web and text data. Tableau exposes the following functions that you can use in
calculated fields:
 JSON Objects: GET_JSON_OBJECT retrieves data elements from strings containing

JSON objects.
 URLs: Tableau offers PARSE_URL to extract the components of a URL such as
the protocol type or the host name. Additionally, PARSE_URL_QUERY can
retrieve the value associated with a given query key in a key/value parameter list.
 Text Data: The regular expression find and replace functions in Hive are available
in Tableau for complex text processing.
On-the-fly ETL
Custom SQL gives you the flexibility of using arbitrary queries for your connection,
which allows complex join conditions, pre-filtering, pre-aggregation and more.
Traditional databases rely heavily on optimizers, but they can struggle with complex
Custom SQL and lead to unexpected performance degradation as you build views. The
batch-oriented nature of Hadoop allows it to handle layers of analytical queries on top of
complex Custom SQL with only incremental increases to query time.
Because Custom SQL is a natural fit for the complex layers of data transformations seen
in ETL, a Tableau connection to Hive based on Custom SQL is essentially on-the-fly
19
ETL. Refer to the Hadoop and Tableau Demo on the Tableau blog to see how you can
use Custom SQL to unpivot nested XML data directly in the Custom SQL connection,
yielding views built from the unpivoted data.
Initial SQL
Tableau supports initial SQL for Hadoop Hive connections, which allows you to define a
collection of SQL statements to perform immediately after the connection is established.
For example, you can set Hive and Hadoop configuration variables for a given
connection from Tableau to tune performance characteristics. Refer to the Designing for
Performancearticle for more information.
Custom analysis with UDFs and Map/Reduce
Similarly, Tableau allows you to take advantage of custom UDFs and UDAFs built by
the Hadoop community or by your own development team. Often these are built as JAR
files that Hadoop can easily copy across the cluster to support distributed computation.
To take advantage of JAR files or scripts, inform Hive of the location of these files and
Hive will take care of the rest.
20
2.6 Risk Assessment and Management

The following were the disadvantages of hadoop:
 As big data is not suitable for small business
 There is a missing encryption methodology for storage and network levels.
 There are lot of stability issues in Hadoop.
3 Project Requirements
3.1 Identification of Requirements

Their creation, called Hive, allows SQL developers to write Hive Query Language
(HQL) statements that are similar to standard SQL statements; now you should be aware
that HQL is limited in the commands it understands, but it is still pretty useful. HQL
statements are broken down by the Hive service into Map Reduce jobs and executed
across a Hadoop cluster.
Hive looks very much like traditional database code with SQL access. However, because
Hive is based on Hadoop and Map Reduce operations, there are several key differences.
The first is that Hadoop is intended for long sequential scans, and because Hive is based
on Hadoop, you can expect queries to have a very high latency (many minutes). This
means that Hive would not be appropriate for applications that need very fast response
times, as you would expect with a database such as DB2. Finally, Hive is read-based and
therefore not appropriate for transaction processing that typically involves a high
percentage of write operations.
Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into Map Reduce jobs executed on Hadoop. In addition, HiveQL supports custom
Map Reduce scripts to be plugged into queries. Hive also enables data
serialization/deserialization and increases flexibility in schema design by including a
system catalog called Hive-Metastore.
Hive supports text files (also called flat files), Sequence Files (flat files consisting of
binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a
table in a columnar database way.)
3.2 Security and Fraud Prevention

Since fraud is so hard to prove in courts, most organizations and individuals try to
21
prevent fraud from happening by blanket measures. This includes limiting the amount of
damage the fraudster can impact on the organization as well as early detection of fraud
patterns. For example, credit card companies can cut the credit card limit across the board
in anticipation of a few negative fraud cases. Advertisers can prevent advertising
campaigns with low number of qualifying events. And anti-terrorism agencies can
prevent people with bottles of pure water from boarding the planes. These actions are
often in contrast with the company efforts to attract more customers and result in general
dissatisfaction. To the rescue are new technologies like Hadoop, Influence Diagrams and
Bayesian Networks which are computationally expensive (these are NP- hard in computer
science terminology) but are more accurate and predictive.Hadoop is known for its
gnawing power. Nothing can compare with the throughput power of thousands of
machines each of which has multiple cores. As was reported recently at the Hadoop
Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8
to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the
same time. This allows either (a) looking through larger periods of time to incorporate
events across a larger time frame or (b) taking more sources of information into account.
It is quite common among social network companies to comb through twitter blogs in
search of relevant data.
Finally, one of the fraud prevention problems is latency. The agencies want to react to an
event as soon as possible, often within a few minutes of the event. Yahoo recently
reported that it can adjust its behavioral model in a response to a user click event within
5-7 minutes across several hundred of millions of customers and billions of events per
day. Cloudera has developed a tool, Flume, that can load billions of events into HDFS
within a few seconds and analyze them using MapReduce.
22
Often fraud detection is akin to “finding a needle in a haystack”. One has to go through
mountains of relevant and seemingly irrelevant information, build dependency models,
evaluate the impact and thwart the fraudster actions. Hadoop helps with finding patterns
by processing mountains of information on thousands of cores in a relatively short
amount of time.
As fraud and security breaches are becoming more frequent and sophisticated, traditional
security solutions are not able to protect company assets. MapR enables organizations to
analyze unlimited amounts and types of data in real time, widen the scale and accelerate
the speed of threat analysis, and improve risk assessment by building sophisticated
machine learning models. Specific use cases include protection against infrastructure
risks as well as consumer-oriented risks across different industries:
 Security Information and Event Management (SIEM): Analyze and correlate large
amounts of real-time data from network and security devices to manage external and
internal security threats, improve incident response time and compliance reporting.
 Application Log Monitoring: Improve analysis of application log data to better
manage system resource utilization, security issues, and diagnose and preempt
production application problems.
 Network Intrusion Detection: Monitor and analyze network traffic to detect, identify,
and report on suspicious activity or intruders.
 Fraud Detection: Use pattern/anomaly recognition on larger volumes and variety of
data to detect and prevent fraudulent activities by external or internal parties.
Risk Modeling: Improve risk assessment and associated scoring by building

sophisticated machine learning models on Hadoop that can take into account
hundreds or even thousands of indicators.
MapR Advantages:
 Easy data ingestion: Copying data to and from the MapR cluster is as simple as
copying data to a standard file system using Direct Access NFS™. Applications can
therefore ingest data into the Hadoop cluster in real time without any staging areas or
separate clusters just to ingest data.
 Existing applications work: Due to the MapR platform's POSIX compliance, any non-
Java application works directly on MapR without undergoing code changes. Existing
toolsets, custom utilities and applications are good to go on day one.
23
 Multi-tenancy: Support multiple user groups, any and all enterprise data sets, and
multiple applications in the same cluster. Data modelers, developers and analysts can
all work in unison on the same cluster without stepping on each other's toes.
 Business continuity: MapR provides integrated high availability (HA), data protection,
and disaster recovery (DR) capabilities to protect against both hardware failure as well
as site-wide failure.
 High scalability: Scalability is key to bringing all data together on one platform so the
analytics are much more nuanced and accurate. MapR is the only platform that scales
all the way to a trillion files without compromising performance.
 High performance: The MapR Distribution for Hadoop was designed for high
performance, with respect to both high throughput and low latency. In addition, a
fraction of servers are required for running MapR versus other Hadoop distributions,
leading to architectural simplicity and lower capital and operational expenses.
24
4 Project Description
Data flows have three steps strategy to move data from RDMS to HIVE on Hadoop.
Step 1: Convert the all data into files by using Sqoop
Step 2: Store file into Hadoop cluster
Connect to the server

Move file into HDFS by using below command
Step 3: Read Data from HIVE

Connect to the Hive from root
19
Create table command:
5 ProjectInternal/external Interface Impactsand Specification

Hadoop has become one of the most important technologies, the key factors of hadoop
are as follows:
Low Cost :
Hadoop is an free open Source frame work, and uses commodity hardware to store
substantial amount of data. Hadoop additionally offers a practical stockpiling answer for
organizations' blasting information sets. The issue with customary social database
administration frameworks is that it is amazingly taken a toll restrictive to scale to such
an extent so as to process such huge volumes of information. With an end goal to lessen
costs, numerous organizations in the past would have needed to down-example
information and characterize it in view of specific presumptions as to which information
was the most significant. The crude information would be erased, as it would be
excessively taken a toll restrictive, making it impossible to keep. While this methodology
may have worked in the short term, this implied that when business needs changed, the
complete crude information set was not accessible, as it was so costly it was not possible
store. Hadoop, then again, is outlined as a scale-out structural planning that can
reasonably store the majority of an organization's information for later utilize. The
expense reserve funds are stunning: as opposed to costing thousands to a huge number of
pounds every terabyte, Hadoop offers figuring and stockpiling capacities for many
pounds every terabyte.
Flexible:
Hadoop empowers organizations to effectively get to new information sources and tap
into distinctive sorts of information (both organized and unstructured) to create esteem
from that information. This implies organizations can utilize Hadoop to get profitable
business bits of knowledge from information sources, for example, social networking,
email discussions information. Moreover, Hadoop can be utilized for a wide mixed bag
of purposes, for example, log handling, suggestion frameworks,information warehousing,
business
20 sector battle investigation and misrepresentation discovery.
Computing power:
Hadoop is a distributed processing model, so it can prepare vast measure of
information. The additionally registering hubs you utilize.
Fast:
Hadoop’s unique storage method is based on a distributed file system that basically
‘maps’ data wherever it is located on a cluster. The tools for data processing are often on
the same servers where the data is located, resulting in much faster data processing. If
you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently
process terabytes of data in just minutes petabytes in hours.
Storage Flexibility:
Unlike traditional relational databases, you don’t have to preprocess data before storing
it. And that includes unstructured data like text, images and videos. You can store as
much data as you want and
decide how to use it later.Inherent data protection and self-healing capabilities:
Data and application processing are protected against hardware failure. If a node goes
down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. And it automatically stores multiple copies of all data.

Data Migration From RDBMS To Hadoop: Platform Migration Approach

Uploaded by

Copyright:

Available Formats

Data Migration From RDBMS To Hadoop: Platform Migration Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Migration From RDBMS To Hadoop: Platform Migration Approach

Uploaded by

Copyright:

Available Formats

2

Data Migration from RDBMS to

1.1 Project Abstract

1.2 Relationship to Other Applications/Projects

1.3 Assumptions and Dependencies

 Flavors of the hadoop-client Artifact

1.4 Future Enhancements

1.5 Definitions and Acronyms

ETL.. Ectract, Transform

 Open-source software. Open-source software is created and maintained by a

By utilizing Sqoop we will import information from a social database framework

2.1 Project/Application Architecture

2. On-going, “Change Only” data loads from the source

table to HIVE. Below, both File Processing and Database-

direct (SQOOP) ingest will be discussed.

2.3 File Processing

Step 1: Converting RDMS Data into Hadoop

 WebHDFS: Primarily used when integrating with applications, a Web URL

Step 2: Store file into Hadoop cluster

Step 3: Read Data from HIVE

2.4 Interactions with other Applications

Major Components of Hive and its interaction with Hadoop:

Hadoop in a post-MapReduce world

 Horizontal scalability with increasing data size and compute capacity

The Tez design philosophy

Web and text processing

 JSON Objects: GET_JSON_OBJECT retrieves data elements from strings containing

Custom analysis with UDFs and Map/Reduce

2.6 Risk Assessment and Management

3.1 Identification of Requirements

3.2 Security and Fraud Prevention

Risk Modeling: Improve risk assessment and associated scoring by building

Step 1: Convert the all data into files by using Sqoop

Step 2: Store file into Hadoop cluster

Connect to the server

Step 3: Read Data from HIVE

Create table command:

5 ProjectInternal/external Interface Impactsand Specification

You might also like