Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Demystifying Big Data RGc1.0

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Demystifying Big Data Analytics – From data to outcomes

What is Big Data?


“Big data” is data that ranges in sizes of Terabytes and Petabytes . It includes structured and unstructured data and
might need to be processed real-time.

Big data has the following identifiable properties:.

 High Volume – The data volumes are massive and a challenge to handle, example Terabytes and Petabytes.
 High Velocity – The change rate of that data is high, real-time, or near real-time.
 High Variety – The data can have many different formats (Ex. Images, Videos, Free-Text, Locations, GPS
information etc.).
 Value – Analytics insight is required in a timely fashion.
 Verification – There will be bad data and with diverse data there will be more diverse quality and security
levels of users.

Big Data Analytics Components


Big-data solutions have a number of different layers of abstraction from data abstraction to data analytics.

Source Ingest Data Store Data Transform Serve


Data
Figure – 1: Major components of Big Data

The major components of Big Data Analytics solutions are as follows:

 Big data sources


 Data massaging and store layer (Ingest & Store)
 Analysis layer (Transform)
 Consumption layer (Serve)

Big data sources - Big data solutions have data sourced from diverse channels. The main driver to select a data
source will be the type of data and the type of analysis required to be performed. The data may vary in format and
origin: Format— Structured, semi-structured, or unstructured. The other drivers of selecting a data source are listed
below.

 Velocity and volume— the speed with which data arrives and the rate at which it's delivered varies
according to data source.
 Collection point— Where the data is collected, directly or through data providers, in real time or in batch
mode. The data can come from a primary source, such as Twitter, or it can come from a secondary source,
such as a data provider company which has access to Twitter Fire-hose (like Gnip).
 Location of data source— Data sources can be inside the enterprise or external. Identify the data to which
you have limited-access, since access to data affects the scope of data available for analysis.

1
Data massaging and store layer - This is the layer where the data is acquired from the data sources and
converted to a format which can be stored in Hadoop Distributed File System (HDFS) store or a No-SQL DataBase
(MongoDB, CouchDB, HBase) or Relational Database Management System (RDBMS) warehouse for further
processing. Compliance regulations, data privacy and governance policies dictate the appropriate storage for
different types of data.

Analysis layer - This is the layer which “reads” the data stored in the Data store. In some cases, the analysis layer
accesses the data directly from the data source. Designing the analysis layer requires careful forethought and
planning. Decisions must be made with regard to how to manage the tasks to: Produce the desired analytics

Consumption layer - This layer consumes the output provided by the analysis layer. The consumers can be
visualization applications, human beings, business processes, or services. It can be challenging to visualize the
outcome of the analysis layer. Sometimes it's helpful to look at what competitors in similar markets are doing.

What is big data analytics?


Big Data Analytics is the science of managing highly volatile massive and diverse volumes of data intelligently in an
actionable time span.

Why collect and store terabytes of data if you can't analyze it in full context? Or if you have to wait hours or days to
get results?

How is Big Data Analytics different than enterprise data warehouses (EDWs) and using business intelligence (BI)
tools to report on the business?

Big Data predictive analytics is different; it’s advanced statistical, data mining, and machine learning algorithms that
dig deeper to find patterns that traditional BI tools may not reveal. Many of these techniques are not new, but
advanced algorithms which can deal with more data can mean more and better predictive models. Big data is the
fuel and predictive analytics is the engine that firms need to discover, deploy, and profit from the knowledge they
gain

Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and
other useful information that can be used to make better decisions. Its software and/or hardware solutions that
allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources to improve
business performance or mitigate risk.

Predictive Analytics – A Continuous process


Predictive analytics uses algorithms to find patterns in data that might predict similar outcomes in the future. But
this isn’t a one-time operation; it’s a continuous process of making sure making sure that new models are still
effective and to respond to changes in customer desires and competitors. Below is the predictive analysis process.

2
Market Overview of Predictive Analytics Solutions
The main purpose big data predictive analytics solutions are to facilitate the predictive analytics process and ease
the burden of this never-ending, continuous cycle of model discovery, deployment, and optimization that can be
applied to most industries and business domains.

Vertical or horizontal solutions - Many vendors provide solutions that focus on specific industry or horizontal
domains, such as customer analytics like Fair Isaac (FICO) and Pitney Bowes.

Embedded solutions - Other platforms increasingly embed predictive analytics capabilities. BI platforms such as
Alteryx and Pentaho include embedded predictive analytics features in addition to BI functionality. Business process
management (BPM) platforms such as Pegasystems and Rage Frameworks also offer predictive analytics capabilities.

Database analytics - Relational database management systems (RDBMS), EDWs, NoSQL, Hadoop, and other
data-focused hardware and software have some predictive analytics capabilities.

3
Market Leader for Big Data Analytics solutions

 SAS - SAS, with its 36- year history of providing analytics software, is a Leader, its SAS Enterprise Miner
tool is easy to learn and can run analysis in-database or on distributed clusters to handle big data.

 IBM – IBM with SPSS, Netezza, and Vivisimo are the big data predictive analytics solution with
complementary solutions, such as InfoSphere Streams and Decision Management that strengthen the
appeal for firms that wish to integrate predictive analytics throughout their organization.

 SAP - SAP is a newcomer to big data predictive analytics but is a Leader due to a strong architecture and
strategy. SAP also differentiates by putting its SAP HANA in-memory appliance at the center of its offering,
including an in-database predictive analytics library (PAL), and offering a modeling tool.

Real World Big Data Solutions

Most of Big Data Warehouse about building on an existing data warehouse infrastructure, leveraging big data
technologies to ‘augment’ its capabilities. There are three key types of Architecture:

Pre-Processing - Using big data capabilities as a “landing zone” before determining what data should be moved
to the data warehouse.

Offloading - Moving infrequently accessed data from data warehouses into enterprise-grade Hadoop.

Exploration - Using big data capabilities to explore and discover new high value data from massive amounts of
raw data and free up the data warehouse for more structured, deep analytics.

A complete big data analytics solution needs Big Data specific tools to manage Big Data (Hadoop) clusters, integrate
data, clean/enrich data, model data, perform predictive analytics, and visualize data. Data flows between the big
data system and the data warehouse to create a unified foundation for analytics. BI tools such as MicroStrategy,
Tableau, IBM Cognos, and others provide business users with direct access to data warehouse insights.

Common business use cases which require Big Data solutions


Some of the common business cases for big data are:

4
Financial services – firms can use social media sensing for Risk analysis and fraud management. The social media
data can help to do risk modeling and rating on company’s risk profile. Compliance, regulatory fraud and reporting
can be done, who is saying what in public domain, are they aligning with Firm’s regulatory or ethics policy.

E-commerce and online retail - E-retailers like eBay are constantly creating target offers to boost customer lifetime
value. They need Recommendation engines for increasing average order size or Next-best offer based on predictive
analysis for cross-selling and tailored interactions across multiple interaction channels.

Conclusion:
Analyzing new and diverse digital data streams can reveal new sources of economic value, provide fresh insights into
customer behavior and identify market trends early. But this influx of new data creates challenges for IT
departments to derive real business value from big data. IT departments need the right tools to capture and
organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of
all your enterprise data.

Big data analytics has opened a whole new paradigm of data analytics, architecture and engineering to give new
dimensional of data for business to make the most informed decisions from being reactive to be pre-active.

5
Appendix:

What is Apache Hadoop ?


Big data is often associated with Hadoop. What is Apache Hadoop? It’s an open source framework for Big Data
solutions on commodity hardware.

What’s the big deal about Apache Hadoop?


Hadoop changes the economics and the dynamics of large scale computing as its build on open source framework
and solutions on commodity hardware. It has four salient characteristics.

Scalable– New nodes can be added as needed without needing to change data formats, how data is loaded, how
jobs are written, or the applications on top.

Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable
decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources.
Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one
system can provide.

Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues
processing without missing a beat.

Hadoop enables businesses to gain insight from massive amounts of structured and unstructured
data quickly in a cost-effective and fault tolerant way.

Components of Apache Hadoop


Hadoop is has two main components, the framework that understands and assigns work to the nodes in a cluster
and a file system that spans all the nodes which stores the data. It is supplemented by an ecosystem of many open
source projects/frameworks which extend its value and improves the usability.

Main components of Hadoop.


MapReduce - The framework that understands and assigns work to the nodes in a cluster. It process large amounts
of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-
tolerant manner.

HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems
on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by
replicating data across multiple nodes.

Key supplement components of Hadoop.


Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for
expressing data analysis programs paired with the MapReduce framework for processing these programs.

Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-
hoc queries via an SQL-like interface for large datasets stored in HDFS.

6
HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big
data for user applications.

Flume – Flume allows Big Data systems to efficiently aggregate and move large amounts of log data from many
different sources to Hadoop.

Sqoop - Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable
parallel load for various, popular enterprise data sources.

Knox – It provides a single point of authentication and access for Apache Hadoop services in a cluster.

Zookeeper – A system for coordinating distributed processes like to store and mediate updates to important
configuration information.

Oozie – Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs
sequentially into one logical unit of work.

The Ecosystem of open source frameworks for Hadoop

Commercial Big data distributers for enterprise


The Big Data solutions landscape is very varied and breaks down as follows.

 Apache open source - The open source frameworks can be download from the Apache Hadoop website at
www.hadoop.apache.org . The project includes the core modules as described Hadoop Common, Hadoop,
Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce.
 Pure-play Hadoop distribution vendors - Cloudera, Hortonworks, and MapR Technologies are
venture-backed firms that singly focus on developing, supporting, and marketing unique Hadoop distributions,
add-on innovations, and services. These vendors sell their solution directly to customers but also have an
aggressive channel strategy of selling through partners, such as large enterprise software vendors.
 Enterprise software vendors that also offer Hadoop distributions - All of the big enterprise
software vendors have a Hadoop strategy because it is an essential data management technology. Many of
them partner with one or more pure-play vendors: For example, Oracle partners with Cloudera, while SAP
partners with both Intel and Hortonworks. Others like IBM, Microsoft, Pivotal, and Teradata.
Microsoft partners with Hortonworks and has used this as a base to create HDInsight for Windows Azure
and Teradata partners with Hortonworks.

7
 Hadoop in the cloud - Amazon Web Services offers Elastic MapReduce (EMR) and MapR Technologies in
the cloud. Microsoft offers HDInsight in its Azure cloud. Most of the other distributions have some cloud
deployment options.
 Hadoop accessories that build out the ecosystem – Hadoop is only one part of a complete big data
analytics solution. Many firms offer Hadoop-specific tools to manage Hadoop clusters, integrate data, model
data, perform predictive analytics, and visualize data in Hadoop. Here are just a few of the many firms, both
new and established, that offer Hadoop-based tools: Actian, Compuware, DataTorrent, Global IDs, Pentaho,
Platfora, Revelytix, Revolution Analytics, SAS Institute, Software AG, Talend, and Zettaset.

Market Leaders for Big Data solutions

 IBM with InfoSphere BigInsights

 Amazon Web Services for the cloud

 Cloudera

 Hortonworks with open source frameworks

Strong Leaders for Big Data solutions

 Intel

 Microsoft’s Windows Azure HDInsightWeb

References:
http://hortonworks.com/hadoop/

http://www-01.ibm.com/software/data/infosphere/hadoop/

http://www.ibm.com/developerworks/library/bd-archpatterns1/index.html?ca=drs-

8
http://www.oracle.com/technetwork/database/bi-datawarehousing/data-warehousing-wp-12c-1896097.pdf

http://www.forrester.com/Big-Data

Glossary:
Accumulo - Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a
scalable implementation of Google’s Big Table design that works on top of Apache.

Ambari - Ambari is a completely open operational framework for provisioning, managing and monitoring Apache
Hadoop clusters.

HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big
data for user applications.

HCatalog - HCatalog is a table and storage management layer for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the
grid.

Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-
hoc queries via an SQL-like interface for large datasets stored in HDFS.

Falcon - Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop. It
enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and
data retention use cases.

Flume – Flume allows Big Data systems to efficiently aggregate and move large amounts of log data from many
different sources to Hadoop.

Knox – It provides a single point of authentication and access for Apache Hadoop services in a cluster.

Oozie – Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs
sequentially into one logical unit of work.

Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for
expressing data analysis programs paired with the MapReduce framework for processing these programs.

Sqoop - Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable
parallel load for various, popular enterprise data sources.

Solr - Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its
major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic
clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.

Storm - Storm is a distributed real-time computation system for processing fast, large streams of data. Storm adds
reliable real-time data processing capabilities to Apache Hadoop.

Tez – Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG
(directed acyclic graph) of tasks.

9
Zookeeper – A system for coordinating distributed processes like to store and mediate updates to important
configuration information.

10

You might also like