Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Big Data and You
2015
May
Edition
Objectives
This document is designed to introduce big Data
and Analytics . Instead of being deep dive
technical paper or product portfolio details,
friendly educational presentation (easily and
quickly read) for specialists, architects, PMs and
managers (*). One simple goal (but complex and
time consuming exercise): is you read this paper,
you learn something and then you would like to get
more details to become an expert. Yes, You can
Big Data
Table of Contents
1. Introduction
2. Definition
3. BI principles
4. Chronology
5. Hadoop I
6. Hadoop II
7. Hadoop Ecosystem
8. BI vs Big Data
9. Hadoop patterns
10. Hadoop Market
Introduction
2012 was the big data marketing buzz, 2013 was the big
data technical enablement, 2014 was the big data projects.
Now European customers are massively deploying big data
(and still analytics) projects. It is time to become an expert
to guide our customers and talk with Big Data ecosystem
to fill the Big Data skills gap
(*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects,
recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data
11. BD&A vendors
12. Competition
13. In Memory
14. Streams
15. BigInsights
16. Architecture
17. Positioning
18. Why Power ?
19. Contacts
20. New !
Author # Christophe.menichetti@fr.ibm.com
# 1
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Introduction Definition
What is Data Analysis ? Why Analysing Data ?
Analysis of data is a process of inspecting, cleaning, transforming,
and modelling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making.
Data analysis has multiple facets and approaches, encompassing
diverse techniques under a variety of names, in different business,
science, and social science domains, such as :
Business Intelligence/Analytics
Data Mining / predictive Tools
Big Data
Data integration/ Data visualisation
And so on …
IT technologies and computer sciences are evolving. Yesterday,
when IBM, Honeywell, Sperry, ICL, Xerox,Digital or Olivetti were
the IT leaders, CPU and Memory were the key differentiators.
Today, when IBM, Google,SAP, Oracle are the IT leaders, the
ultimate differentiator is being able to make more informed
choices with confidence, to anticipate and shape business
outcomes.
As company and industry leaders, you absolutely need deeper
insight from their information, to beat your competitors :
• Which customers are thinking of leaving?
• Which transactions are fraudulent?
• Detect life-threatening conditions in time to intervene
Let’s make it simpler – An example
Analytics = transforming
data into (sexy)
information to make
(intelligent) decision
Weather Forecast : You should decide
which boot you’ll take to go to Paris.
You are not expert at all (temperature,
pressure, cyclone = RAW data) but you
can decide based on weather map
(report/analysis)
!message : Data is the new oil requiring Mining, Refining and Delivering
BI Principles Chronology Hadoop I Hadoop II
Big data and You
# 2
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Definition
What is Business Intelligence ?
Business analytics (BA) refers to the skills, technologies,
practices for continuous iterative exploration and investigation of
past business performance to gain insight and drive business
planning.
Business analytics focuses on developing new insights and
understanding of business performance based on data and
statistical methods.
In contrast, business intelligence (BI) traditionally focuses on
using a consistent set of metrics to both measure past
performance and guide business planning, which is also based on
data and statistical methods
Big Data is a broad term for data sets so large or complex that they
are difficult (or too expensive) to process using traditional data
processing applications. Challenges include analysis, capture, curation,
search, sharing, storage, transfer, visualization, and information
privacy.
What is Big Data ?
!message : Big Data creates new opportunities to extend Analytics for higher value
BI Principles Hadoop I Hadoop IIIntroduction Hadoop Ecosystem
Big data and You
4th V: Value
5th V: Veracity
For more information/technical details, feel free to contact us
OLTP versus OLAP
# 3
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
BI reference Architecture
Reporting solutions
display data in a either synthesized or
detailed view, easy to understand for
the end user (data mining: discovering
Interesting/useful patterns
/relationships in large volumes of
data – analyzing the past to predict
the future)
Data warehouse
central database in which data are
stored and can be restructured to
answer Business needs.
ETL
Unifies data from heterogeneous data
sources (extracting the useful data)
Consolidates them into a unique
destination database (cleansing,
modifying the data according to the
desired output)
Good to know !
People, very often, associate BI with reporting/data mining tool, because this is the “visible” part of the iceberg. But This is an
misnomer, BI refers to the full set of tools, such as Reporting, Data warehouse and ETL. For your information, ~70% of the costs and
efforts in BI projects is about the data warehouse, the most important (but hidden) part of the “iceberg”.
Star Schema
Optimized for SQL read requests. Fact
table (metrics of the reports) in the middle,
surrounded by dimension tables (Y axis)
= On Line Analytical Processing (OLAP)
3NF Schema
Optimized for flexibility and storage
space savings = On Line Transactional
Processing (OLTP)
How does Analytics work ? What does OLAP mean ?
!message : BI/Analytics is the way to transform raw data into decision/information
Definition BI Principles Hadoop IHadoop IIuction Hadoop EcosystemChronology BI vs B
Big data and YouAny Analytics Projects/ questions ? Do not hesitate to contact us
First steps - early1950
IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn)
Birth of the wording “Business intelligence”
First tools for automatic methods, providing alert services (for scientists)
1970
First MIS solutions – Management Information System
Static, non flexible
No analysis features
1980
First EIS software – Executive Information System
More sophisticated MIS: simulations, report, forecast,
1990
BI concepts, is officially formalized by Howard Dresner, Gartner Group analyst
Birth of Business Performance Management (BPM / EPM)
2005 – 2010
BI market strong consolidation – big major IT acquisitions
Oracle acquired Siebel (Report - 6B$), Hyperion (EPM- 4B$), Sunopsis (ETL- 1 B$)
SAP acquired Business Objects (Report – 7B$), Sysbase (DW – 6B$), Fuzi (ETL),
IBM bought Cognos (Report – 5B$), Netezza (DW – 2B$), Ascential (ETL – 1B$)
-
Yahoo and Google faced terrible performance issues with DW architecture – Need
of rethinking data analysis approach – birth of Hadoop
2012 and +
Birth of Big data
# 4
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
A little bit of history ?
!message : Analytics has evolved from business initiative to business imperative
Definition BI Principles Hadoop I Hadoop IIHadoop EcosystemChronology BI vs BigData Hadoop
Big data and You
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Why Hadoop ?
1
2
Performance issue : Consider that over the past decade :
- CPU speed performance has increased 8 to 10 times
- DRAM speed performance has increased 7 to 9 times
- Network speed performance has increased 100 times
- Bus speed performance has increased 8 to 10 times
- Hard disk drive speed performance has increased ONLY 1.2 times
NoSQL: Not Only SQL
Mechanism for storage and retrieval of data that is modeled in means other than
the tabular relations used in relational databases.
 Motivations for this approach include simplicity of design, horizontal
scaling, finer control over availability and most importantly COST
!message : Hadoop meets the need of new scalable architectures providing a business
Efficiency and flexibility over the existing relational data model
ciples Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market
# 5
Big data and YouWould like to bench/test ? Go to MOP Client Center
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
How does it work ?
Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets
(Big Data) on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large
blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that
have the required data, which the nodes then process in parallel.
This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more
conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking
Would like to appear like an expert ?
HDFS default replication : 3 x, HDFS default blocks size = 128 MB, HDFS sits on top of a native Linux filesytem (ext4, ext3), Slave nodes : HDFS
(= data node), MapReduce (= task tracker) , Master nodes : HDFS (= name node), MR (= job tracker), secondary name node is for High Availability
!message : Volume and Variety challenges have led to the creation of new data
processing : Map Reduce and HDFS
Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market BD&A
# 6
Big data and YouWould like briefing ? Go to MOP Client Center
YARN, “the hadoop 2 “ decouples MapReduce's resource management and
scheduling capabilities, enabling Hadoop to support more varied processing
approaches/applications (interactive SQL, real-time streaming, batch processing) # 7
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Flume was created to allow you to
flow data from a source into your
Hadoop® environment.
ZooKeeper provides a centralized
infrastructure and services that
enable synchronization across a
cluster. ZooKeeper maintains
common objects needed in large
cluster environments like
configuration information,
hierarchical naming space …
HBase is a column-oriented
database management system
that runs on top of HDFS. It is well
suited for sparse data sets, which
are common in many big data use
cases
Some folks at Facebook developed
Hive™, allowing SQL developers to
write Hive Query Language (HQL)
statements that are similar to
standard SQL statements
Oozie simplifies workflow and
coordina¬tion between jobs. It
provides users with the ability to
define actions and dependencies
between actions.
Pig initially developed at Yahoo!
allows people to focus more on
analyzing large data sets and spend
less time having to write mapper and
reducer programs.
Sqoop is a connectivity tool for
moving data from non-Hadoop
data stores – such as relational
databases and data warehouses –
into Hadoop
Mahout takes the most popular data mining algorithms
for performing clustering, regression testing and
statistical modeling and implements them using the
Map Reduce model
Ambari is a web-based set
of tools for deploying,
administering and
monitoring Apache Hadoop
clusters
!message : The HDFS file system is not restricted to MapReduce jobs. It can be used
for other applications, many of which are under development at Apache
Hadoop II Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition
Big data and You
# 8
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Different Approaches
Don’t take us wrong : there is no bad approach or
good approach, there is no magical approach.
There are different approaches, for different
needs and results.
With BI approach, Business Users determine what
question to ask (business hypothesis) and IT team
structures the data (specific selected data into
data warehouse) to answer to the question.
With Big Data approach, IT delivers (all data) a
platform to enable creative discovery and
Business Users Explores what questions could be
asked
Different Architectures
BI architecture: Application server and Database
server are separated, Network is still in the
middle, Data have to go through the network.
Big Data architecture: Analysis Program runs
where are the data : Functions have to go through
the network. This is highly scalable and flexible by
design
Different Objectives
Hadoop is one of the multiple facets of Big Data.
This facet (Hadoop) is designed to run huge
(Volume) “read” batch, in extreme costs savings
way for unstructured data (Variety)
!message : Do not compare apples and oranges : you should (still) need both
Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory
Big data and YouFor more information/technical details, feel free to contact us
# 9
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Technical Hadoop Patterns
Big Data
Exploration
Find, visualize,
understand all big
data to improve
decision making
Enhanced 360o
View
of the Customer
Extend existing
customer views
(MDM, CRM, etc) by
incorporating
additional internal
and external
information sources
Operations Analysis
Analyze a variety of machine
data for improved business results
Data Warehouse Augmentation
Integrate big data and data
warehouse capabilities to
increase operational efficiency
Security/Intelligence
Extension
Lower risk, detect
fraud and monitor
cyber security in real-
time
Big Data Business Use Cases
Keep in Mind
The term Big Data is a bit of a misnomer. Big data is not
only referring to huge volume of data or Hadoop, there are
many others patterns using streams or in memory solutions
!message : Big Data Analytics are applied Across all Industries, different use cases
BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights
Big data and You
# 10
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Hadoop has been most rapidly adopted by the government,banking,finance,IT and ITES, and insurance sectors
Geographical analysis of the market seems to suggest that North Americais the leadingrevenuegenerating market and will continue to
remain so till 2020.
Hadoop hardware-based,solution providershave been the highest receivers of venture capital funding.The recent times have witnessed a steep
demandfor real-time,operationalanalytics
!message : In 1990’s new performing hardware was the differentiator for companies
to compete. Nowadays big data is the key competitive differentiator
Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture
Big data and You
Hortonworks study – 2014 wikibon figures - 2013
# 11
IBM Montpellier Client Center
The market for Big Data &
Analytics solutions has
exploded
The race is hot and complex:
 Every vendor is
jumping in
 Alternatives from
everywhere
 Startups proliferate
 Partnerships
No other vendor has what IBM
have
– Software/ Hardware
– Services / Research
– Cloud, Mobile, Social
Yet just having ‘everything’
does not make for a market
leader
Based primarily on 2012 Wikibon report/forcast http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017
!message : The race is hot, Every vendor is jumping in, Alternatives from everywhere,
Startups proliferate, how do we differentiate in such a crowded market?
Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning
Big data and YouAny competitive big data questions ? feel free to contact us
# 12
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
4 major distributions
of Hadoop have
spawned ecosystems
of partners
developing data
management and
analytic solutions for
Big Data
!message : IBM is a global Big data and Analytics leaders, industry’s most comprehensive
and enterprise class solutions, broadest portfolio
BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Why Power?
Big data and YouAny competitive big data questions ? feel free to contact us
# 13
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
In-Memory - good timing for an old idea
Largely driven by the big data phenomenon, In-memory computing is a powerful,
transformative IT trend to meet high-performance analytics expectations and
data visualization needs. In memory solution should not be confused with
conventional DBMS storing data in disk blocks cached in memory.
In-Memory” Database technology has been around for over a decade.
Traditionally in-memory technology was used in a limited number of operational
applications workloads (FSS trading, Telco Billing, HPC, embedded devices) but in
2011 we saw Inflection Point : Increased focus and ‘push’ by SAP
With in-memory database, all information is initially loaded into memory. This
eliminates the need for optimized databases, indexes, aggregates and designing of
cubes and star schemas. The arrival of column centric databases which stored similar information
together allowed storing data more efficiently with greater compression
and faster read access , reducing the amount of memory needed to
perform a query and increasing processing speed. That’s why column-
based technology is very often associated to in memory technology
Column Based Technology
Volume: users /data
increase, RAM needed also
increases = hardware
costs
Velocity : real time
analytics, operational
analytics
!message : Big Data analytics can benefit from these very large in memory
Systems for velocity (since Memory has become cheaper)
dors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
# 14
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
 Deal with Terabytes of data
each second
 Work with application,
sensor and internet data,
video/audio
 Deliver insight in
microseconds to analytical
applications
 Support complex scenarios
using C++ or Java code
Streams is tailor made for companies who need to process data from non-traditional sources, with huge volumes of
data, and need results very, very quickly, integrated with existing analytics investments
 Stream computing is a different paradigm – the left
shows the traditional way data is accessed using
queries to pull the data from a data storage device
such as a data warehouse or database – which is still
valid for many requirements
 The new stream computing paradigm brings data to
the query – data is pushed or flows through the
analytics. This is required for many new use cases in
big data
 Here’s a little more on how streams works and
what you can do with it.
 Each of these square represents an operator.
The data passes (input stream) through each
operator where some action is being performed
on the data (output stream)
 You can fuse data form multiple streams, you
can modify it, annotate it, perform an analytics
operation on it, fuse multiple streams or
classify it.
!message : Velocity challenges have led to the creation of new data computing paradigm
and solution: streaming to bring microseconds effective real time
In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
# 15
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Hadoop is an Open Source implementation and although very well maintained, doing the “job” for
companies it implies a risk. Like Linux, major IT companies provide Hadoop distributions.
IBM took this Hadoop and ruggedized it for enterprises, adding enterprises features such as
performance, resilience and IBM experiences, (bigsheets, bigsql,gpfs…) while maintaining the open
standards 100%. We call it Biginisghts, running on x86, Power Systems and Mainframe (linux)
2 editions : basic edition (100% open
source – free) and Enterprise Edition
BigSheets - a big data
visualization capability
that enables end users to
collect, explore and
uncover actionable
insights through a
commonly understood
spreadsheet experience
(drag and drop, clicks
without any Java or
Hadoop skills)
Adaptive Map Reduce –
Already proven product
from Platform Computing
(HPC acquisition) ,
rewriting Map Reduce
paradigm in C++ (No
garbage collection, faster
memory management),
allowing :
• Optimized Shuffle, map
sort
• Resource management
and scheduling of jobs
is separated
• leverage shared
memory across JVMs,
eliminating data
movement
BigSQL – SQL on Hadoop
is challenging (wide variety
of data, MR is batch
oriented), BigSQL provides
Native full compliant SQL
access to data
stored in BigInsights, Real
JDBC/ODBC drivers, and
optimization based on
Massively Parallel
processing (MPP)
architecture, from DB2
experience
Spectrum Scale – GPFS
FPO (file placement
optimizer) scalable, high
performance, and highly
reliable, 20+ years
experienced product, has
many advantages over
HDFS:
• POSIX compliant
• No single point of
failure
• Multi tenant
• HA/DR solutions
IBM BigInsights for Apache Hadoop v4 has
been just released based on ODP initative
Version 3.0 – Enterprise Edition
!message : IBM Hadoop strategy : better analytics tooling that is easier to use +
commitment to Hadoop open source (ODP initiative)
In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and You
# 16
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
How are leading companies transforming their data and analytics environment
to take advantage of Big Data and provide faster, better insights at reduced
costs within their existing Enterprise Data Warehouses ?
100010010101010
100010010101010
100010010101010
100001101010101100
100001101010101100
000111000010011
000111000010011
!message : The foundational schematic to bring analytics to all stages in the data
lifecycle can be overlaid with specific products that provide the functions
Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouNeed Customer Enablement ? Education ? Send us an email
# 3
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message :
Systems of Record
Structured data from
operational systems
Transformational benefit / business outcomes come from integration of
new data sources with traditional corporate data to find new insights
Systems of Engagement
Data that “connects”
companies with their
customers, partners and
employees
Systems of Insight
Diverse data types that
combine
structured and
unstructured data
for business insight
Streams BigInsights Architecture Positioning Why Power? Contacts/info
In Memory
Hadoop
EDW
Appliance
# 17
Big data and YouNeed Architecture Workshop ? Sizing ? Send us an email
# 18
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Important to keep in mind
Big Data (BigInsights, Cognos, SPSS, …) can run on IBM System z. Customers could take advantages of co-locating business data and OLAP
data, managing high speed transactions and complex queries for real time operational analyticson a single integrated platform and take benefits
of the performance, resiliency and quality of service of IBM Mainframe for critical businesses., as many banks/insurance customers
!message : The infrastructure is a foundational piece to IBM’s perspective of
delivering capabilities and offerings for BD&A
Hadoop is Linux – Linux is Power Hadoop is cheap - Power is cheap
Hadoop ecosystem – PowerLinux market acceptance
Power advantages for Big Data
Linux on Power – run the same commands as linux
on x86 – versions release as the same date
Linux on Power makes 17,6% of top 500 most
linux powerful systems (with 5 in top 10)
POWER8 increases performance, reliability and
availability lead over Intel, alternative to intel
OpenPower foundation brings Rapid innovation to
Power Platform for open linux
Little Endian support makes porting Linux on
x86 applications even easier
Power8 design point is for big data (more
threads, more cache , more bandwidth, CAPI …)
Intel design point is for multiple market
(smart phone, tablet desktop PC, servers …)
Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouFeel free to contact MOP PowerLinux center for more details
# 20
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
IBM BigData RessourcesWw Competency Centers Big Data Analytics Links
Web sites
ibm.com/Hadoop
Information Management Acceleration Zone
PowerLinux Big Data
IBM communities
IBM Systems Big Data and Analytics
BDSC practitioner wiki
IBM Analytics Global
Big Data& Analytics Clients References
IBM Developper Works
https://www.ibm.com/developerworks/analytics/
Please, Please
Help us in improving this document – if any comments / ideas please feel free to send an email
http://bigdatauniversity.com/
http://wikibon.org/wiki/v/Category:Big_Data
http://en.wikipedia.org/wiki/Apache_Hadoop
http://www.slideshare.net/search/slideshow?
searchfrom=header&q=big+data
[INFO] Based on 3 experienced years of big data projects , after many weeks of intensive work for compiling several
presentations done to customers or conferences, synthetizing concepts, the objective of this educational paper is to
clarify some of the concepts and solutions around Big Data in order to better understand the related challenges and
opportunities. But There may be (so many) typing errors, mistakes, misleading words, missing concepts, so Please be kind 
Streams Biginsights Architecture Positioning Why Power? Contacts/info
Big data and YouIf we can not help you directly, we’ill point you to the right person
> Strong history of leadership in open source & standards : IBM has always been a believer in
standardization of interfaces to components of IT and application infrastructure (SQL, Eclipse,
OpenPower …)
> Supports our commitment to open source currency in all future releases
> Accelerates IBM innovation within Hadoop & surrounding applications
> Expecting Hortonworks, Pivotal distribution adoption on PowerLinux
> The current ecosystem is challenged and slowed by fragmented and duplicated efforts. The ODP
Core will take the guesswork out of the process and accelerate many use cases by running on a
common platform. Freeing up enterprises and ecosystem vendors to focus on building business
driven applications.
# 21
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : ODP is clearly a major and strategic choice in Open community to accelerate
Hadoop adoption and grow BigInsights and PowerLinux ecosystem / ISV
NEW AND/OR HOT !!! OPEN DATA PLATFORM
Big data and You
What is Open Data Platform (ODP) ?
> It is an Open-source, non-profit entity, focused, committed in evolving the current state of
the platform, and delivering a Foundation certified, packaged, and tested Reference Distribution
Why Open Data Platform (ODP) ?
Where to position ODP vs Apache ?
> ODP supports the Apache (ASF) mission
> ASF provides a governance model around
individual projects without looking at ecosystem
> ODP aims to provide a vendor-led consistent
packaging model for core Apache components as
an ecosystem
Why IBM is involved in ODP ?
# 22
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : IBM fundamental cloud strategy : Complete cloud offering, mixed between
control and simplicity.
Big data and You
NEW AND /OR HOT !!! Big Data/Analytics and Cloud
Customer Data
Center (On-Premises)
Cloud Data Center
(Off Premises)
SIMPLICITY
CONTROL
PureData for analytics
DB2 BLU
Infosphere Biginsights
Cloudant
DashDB
Softlayer
Cloudant
DashDB
Distributed NoSQL “Data Layer”, Powering
Web, mobile, & IoT since 2009
Available as a fully-managed DBaaS, managed
by you on-premises or hybrid
Transactional JSON “document” database
Spreads data across data centers & devices
Ideal for apps that require:
> Massive, elastic scalability
> High availability
> Geo-location services
> Full-text search
> Occasionally connected users
Data warehouse and analytics
as a service on the cloud
• Next Generation In-Memory
• Columnar
• SIMD Hardware Acceleration
• Actionable Compression
• Support for OLAP SQL extensions
• Connect common 3rd party BI tools
dashDB keeps data warehouse infrastructure out
of your way, allowing you to take benefits of :
# 23
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : Spark is positioned as a fast and general engine for Big Data. It
generalizes the MapReduce model and (could?)is poised to replace MapReduce
Big data and You
NEW AND/OR HOT !!! SPARK
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based
MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a
cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.
Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache
Mesos.For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3.
Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file
system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core.
Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects
# 24
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
!message : From application point of view, data lake challenge is to be an unique
and unified data repositories, queryable like a black box
Big data and You
NEW AND /OR HOT !!! DATA LAKE ARCHITECTURE
IDC in late 2014 stated “By 2017 unified data platform architecture will
become the foundation of BDA strategy. The unification will occur
across information management, analysis, and search technology.”
 A Data reservoir is a data lake that provides data to an
organization for a variety of analytics processing including:
• Discovery and exploration of data
• Simple ad hoc analytics
• Complex analysis for business decisions
• Reporting
• Real-time analytics
 It is possible to deploy analytics into the data reservoir to
generate additional insight from the data loaded into the data
reservoir.
 A data reservoir manages shared repositories of information for
analytical purposes.
 Each Data Reservoir Repository is optimized for a particular type
of processing.
• Real-time analytics, deep analytics (such as data mining), exploratory
analytics, OLAP, reporting, …
Example – Creating a logical warehouse
Information virtualization hides the complexities of where the
data is located. Here different repositories are being used to
host different workloads, but this complexity is hidden by the
information virtualization layer.

More Related Content

Big data and you

  • 1. Big Data and You 2015 May Edition Objectives This document is designed to introduce big Data and Analytics . Instead of being deep dive technical paper or product portfolio details, friendly educational presentation (easily and quickly read) for specialists, architects, PMs and managers (*). One simple goal (but complex and time consuming exercise): is you read this paper, you learn something and then you would like to get more details to become an expert. Yes, You can Big Data Table of Contents 1. Introduction 2. Definition 3. BI principles 4. Chronology 5. Hadoop I 6. Hadoop II 7. Hadoop Ecosystem 8. BI vs Big Data 9. Hadoop patterns 10. Hadoop Market Introduction 2012 was the big data marketing buzz, 2013 was the big data technical enablement, 2014 was the big data projects. Now European customers are massively deploying big data (and still analytics) projects. It is time to become an expert to guide our customers and talk with Big Data ecosystem to fill the Big Data skills gap (*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects, recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data 11. BD&A vendors 12. Competition 13. In Memory 14. Streams 15. BigInsights 16. Architecture 17. Positioning 18. Why Power ? 19. Contacts 20. New ! Author # Christophe.menichetti@fr.ibm.com
  • 2. # 1 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Introduction Definition What is Data Analysis ? Why Analysing Data ? Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains, such as : Business Intelligence/Analytics Data Mining / predictive Tools Big Data Data integration/ Data visualisation And so on … IT technologies and computer sciences are evolving. Yesterday, when IBM, Honeywell, Sperry, ICL, Xerox,Digital or Olivetti were the IT leaders, CPU and Memory were the key differentiators. Today, when IBM, Google,SAP, Oracle are the IT leaders, the ultimate differentiator is being able to make more informed choices with confidence, to anticipate and shape business outcomes. As company and industry leaders, you absolutely need deeper insight from their information, to beat your competitors : • Which customers are thinking of leaving? • Which transactions are fraudulent? • Detect life-threatening conditions in time to intervene Let’s make it simpler – An example Analytics = transforming data into (sexy) information to make (intelligent) decision Weather Forecast : You should decide which boot you’ll take to go to Paris. You are not expert at all (temperature, pressure, cyclone = RAW data) but you can decide based on weather map (report/analysis) !message : Data is the new oil requiring Mining, Refining and Delivering BI Principles Chronology Hadoop I Hadoop II Big data and You
  • 3. # 2 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Definition What is Business Intelligence ? Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence (BI) traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods Big Data is a broad term for data sets so large or complex that they are difficult (or too expensive) to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. What is Big Data ? !message : Big Data creates new opportunities to extend Analytics for higher value BI Principles Hadoop I Hadoop IIIntroduction Hadoop Ecosystem Big data and You 4th V: Value 5th V: Veracity For more information/technical details, feel free to contact us
  • 4. OLTP versus OLAP # 3 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com BI reference Architecture Reporting solutions display data in a either synthesized or detailed view, easy to understand for the end user (data mining: discovering Interesting/useful patterns /relationships in large volumes of data – analyzing the past to predict the future) Data warehouse central database in which data are stored and can be restructured to answer Business needs. ETL Unifies data from heterogeneous data sources (extracting the useful data) Consolidates them into a unique destination database (cleansing, modifying the data according to the desired output) Good to know ! People, very often, associate BI with reporting/data mining tool, because this is the “visible” part of the iceberg. But This is an misnomer, BI refers to the full set of tools, such as Reporting, Data warehouse and ETL. For your information, ~70% of the costs and efforts in BI projects is about the data warehouse, the most important (but hidden) part of the “iceberg”. Star Schema Optimized for SQL read requests. Fact table (metrics of the reports) in the middle, surrounded by dimension tables (Y axis) = On Line Analytical Processing (OLAP) 3NF Schema Optimized for flexibility and storage space savings = On Line Transactional Processing (OLTP) How does Analytics work ? What does OLAP mean ? !message : BI/Analytics is the way to transform raw data into decision/information Definition BI Principles Hadoop IHadoop IIuction Hadoop EcosystemChronology BI vs B Big data and YouAny Analytics Projects/ questions ? Do not hesitate to contact us
  • 5. First steps - early1950 IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn) Birth of the wording “Business intelligence” First tools for automatic methods, providing alert services (for scientists) 1970 First MIS solutions – Management Information System Static, non flexible No analysis features 1980 First EIS software – Executive Information System More sophisticated MIS: simulations, report, forecast, 1990 BI concepts, is officially formalized by Howard Dresner, Gartner Group analyst Birth of Business Performance Management (BPM / EPM) 2005 – 2010 BI market strong consolidation – big major IT acquisitions Oracle acquired Siebel (Report - 6B$), Hyperion (EPM- 4B$), Sunopsis (ETL- 1 B$) SAP acquired Business Objects (Report – 7B$), Sysbase (DW – 6B$), Fuzi (ETL), IBM bought Cognos (Report – 5B$), Netezza (DW – 2B$), Ascential (ETL – 1B$) - Yahoo and Google faced terrible performance issues with DW architecture – Need of rethinking data analysis approach – birth of Hadoop 2012 and + Birth of Big data # 4 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com A little bit of history ? !message : Analytics has evolved from business initiative to business imperative Definition BI Principles Hadoop I Hadoop IIHadoop EcosystemChronology BI vs BigData Hadoop Big data and You
  • 6. IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Why Hadoop ? 1 2 Performance issue : Consider that over the past decade : - CPU speed performance has increased 8 to 10 times - DRAM speed performance has increased 7 to 9 times - Network speed performance has increased 100 times - Bus speed performance has increased 8 to 10 times - Hard disk drive speed performance has increased ONLY 1.2 times NoSQL: Not Only SQL Mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  Motivations for this approach include simplicity of design, horizontal scaling, finer control over availability and most importantly COST !message : Hadoop meets the need of new scalable architectures providing a business Efficiency and flexibility over the existing relational data model ciples Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market # 5 Big data and YouWould like to bench/test ? Go to MOP Client Center
  • 7. IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com How does it work ? Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking Would like to appear like an expert ? HDFS default replication : 3 x, HDFS default blocks size = 128 MB, HDFS sits on top of a native Linux filesytem (ext4, ext3), Slave nodes : HDFS (= data node), MapReduce (= task tracker) , Master nodes : HDFS (= name node), MR (= job tracker), secondary name node is for High Availability !message : Volume and Variety challenges have led to the creation of new data processing : Map Reduce and HDFS Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market BD&A # 6 Big data and YouWould like briefing ? Go to MOP Client Center
  • 8. YARN, “the hadoop 2 “ decouples MapReduce's resource management and scheduling capabilities, enabling Hadoop to support more varied processing approaches/applications (interactive SQL, real-time streaming, batch processing) # 7 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Flume was created to allow you to flow data from a source into your Hadoop® environment. ZooKeeper provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments like configuration information, hierarchical naming space … HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases Some folks at Facebook developed Hive™, allowing SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements Oozie simplifies workflow and coordina¬tion between jobs. It provides users with the ability to define actions and dependencies between actions. Pig initially developed at Yahoo! allows people to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop Mahout takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters !message : The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache Hadoop II Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition Big data and You
  • 9. # 8 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Different Approaches Don’t take us wrong : there is no bad approach or good approach, there is no magical approach. There are different approaches, for different needs and results. With BI approach, Business Users determine what question to ask (business hypothesis) and IT team structures the data (specific selected data into data warehouse) to answer to the question. With Big Data approach, IT delivers (all data) a platform to enable creative discovery and Business Users Explores what questions could be asked Different Architectures BI architecture: Application server and Database server are separated, Network is still in the middle, Data have to go through the network. Big Data architecture: Analysis Program runs where are the data : Functions have to go through the network. This is highly scalable and flexible by design Different Objectives Hadoop is one of the multiple facets of Big Data. This facet (Hadoop) is designed to run huge (Volume) “read” batch, in extreme costs savings way for unstructured data (Variety) !message : Do not compare apples and oranges : you should (still) need both Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Big data and YouFor more information/technical details, feel free to contact us
  • 10. # 9 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Technical Hadoop Patterns Big Data Exploration Find, visualize, understand all big data to improve decision making Enhanced 360o View of the Customer Extend existing customer views (MDM, CRM, etc) by incorporating additional internal and external information sources Operations Analysis Analyze a variety of machine data for improved business results Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency Security/Intelligence Extension Lower risk, detect fraud and monitor cyber security in real- time Big Data Business Use Cases Keep in Mind The term Big Data is a bit of a misnomer. Big data is not only referring to huge volume of data or Hadoop, there are many others patterns using streams or in memory solutions !message : Big Data Analytics are applied Across all Industries, different use cases BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Big data and You
  • 11. # 10 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Hadoop has been most rapidly adopted by the government,banking,finance,IT and ITES, and insurance sectors Geographical analysis of the market seems to suggest that North Americais the leadingrevenuegenerating market and will continue to remain so till 2020. Hadoop hardware-based,solution providershave been the highest receivers of venture capital funding.The recent times have witnessed a steep demandfor real-time,operationalanalytics !message : In 1990’s new performing hardware was the differentiator for companies to compete. Nowadays big data is the key competitive differentiator Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Big data and You Hortonworks study – 2014 wikibon figures - 2013
  • 12. # 11 IBM Montpellier Client Center The market for Big Data & Analytics solutions has exploded The race is hot and complex:  Every vendor is jumping in  Alternatives from everywhere  Startups proliferate  Partnerships No other vendor has what IBM have – Software/ Hardware – Services / Research – Cloud, Mobile, Social Yet just having ‘everything’ does not make for a market leader Based primarily on 2012 Wikibon report/forcast http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 !message : The race is hot, Every vendor is jumping in, Alternatives from everywhere, Startups proliferate, how do we differentiate in such a crowded market? Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Big data and YouAny competitive big data questions ? feel free to contact us
  • 13. # 12 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com 4 major distributions of Hadoop have spawned ecosystems of partners developing data management and analytic solutions for Big Data !message : IBM is a global Big data and Analytics leaders, industry’s most comprehensive and enterprise class solutions, broadest portfolio BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Big data and YouAny competitive big data questions ? feel free to contact us
  • 14. # 13 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com In-Memory - good timing for an old idea Largely driven by the big data phenomenon, In-memory computing is a powerful, transformative IT trend to meet high-performance analytics expectations and data visualization needs. In memory solution should not be confused with conventional DBMS storing data in disk blocks cached in memory. In-Memory” Database technology has been around for over a decade. Traditionally in-memory technology was used in a limited number of operational applications workloads (FSS trading, Telco Billing, HPC, embedded devices) but in 2011 we saw Inflection Point : Increased focus and ‘push’ by SAP With in-memory database, all information is initially loaded into memory. This eliminates the need for optimized databases, indexes, aggregates and designing of cubes and star schemas. The arrival of column centric databases which stored similar information together allowed storing data more efficiently with greater compression and faster read access , reducing the amount of memory needed to perform a query and increasing processing speed. That’s why column- based technology is very often associated to in memory technology Column Based Technology Volume: users /data increase, RAM needed also increases = hardware costs Velocity : real time analytics, operational analytics !message : Big Data analytics can benefit from these very large in memory Systems for velocity (since Memory has become cheaper) dors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
  • 15. # 14 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com  Deal with Terabytes of data each second  Work with application, sensor and internet data, video/audio  Deliver insight in microseconds to analytical applications  Support complex scenarios using C++ or Java code Streams is tailor made for companies who need to process data from non-traditional sources, with huge volumes of data, and need results very, very quickly, integrated with existing analytics investments  Stream computing is a different paradigm – the left shows the traditional way data is accessed using queries to pull the data from a data storage device such as a data warehouse or database – which is still valid for many requirements  The new stream computing paradigm brings data to the query – data is pushed or flows through the analytics. This is required for many new use cases in big data  Here’s a little more on how streams works and what you can do with it.  Each of these square represents an operator. The data passes (input stream) through each operator where some action is being performed on the data (output stream)  You can fuse data form multiple streams, you can modify it, annotate it, perform an analytics operation on it, fuse multiple streams or classify it. !message : Velocity challenges have led to the creation of new data computing paradigm and solution: streaming to bring microseconds effective real time In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP
  • 16. # 15 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Hadoop is an Open Source implementation and although very well maintained, doing the “job” for companies it implies a risk. Like Linux, major IT companies provide Hadoop distributions. IBM took this Hadoop and ruggedized it for enterprises, adding enterprises features such as performance, resilience and IBM experiences, (bigsheets, bigsql,gpfs…) while maintaining the open standards 100%. We call it Biginisghts, running on x86, Power Systems and Mainframe (linux) 2 editions : basic edition (100% open source – free) and Enterprise Edition BigSheets - a big data visualization capability that enables end users to collect, explore and uncover actionable insights through a commonly understood spreadsheet experience (drag and drop, clicks without any Java or Hadoop skills) Adaptive Map Reduce – Already proven product from Platform Computing (HPC acquisition) , rewriting Map Reduce paradigm in C++ (No garbage collection, faster memory management), allowing : • Optimized Shuffle, map sort • Resource management and scheduling of jobs is separated • leverage shared memory across JVMs, eliminating data movement BigSQL – SQL on Hadoop is challenging (wide variety of data, MR is batch oriented), BigSQL provides Native full compliant SQL access to data stored in BigInsights, Real JDBC/ODBC drivers, and optimization based on Massively Parallel processing (MPP) architecture, from DB2 experience Spectrum Scale – GPFS FPO (file placement optimizer) scalable, high performance, and highly reliable, 20+ years experienced product, has many advantages over HDFS: • POSIX compliant • No single point of failure • Multi tenant • HA/DR solutions IBM BigInsights for Apache Hadoop v4 has been just released based on ODP initative Version 3.0 – Enterprise Edition !message : IBM Hadoop strategy : better analytics tooling that is easier to use + commitment to Hadoop open source (ODP initiative) In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and You
  • 17. # 16 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com How are leading companies transforming their data and analytics environment to take advantage of Big Data and provide faster, better insights at reduced costs within their existing Enterprise Data Warehouses ? 100010010101010 100010010101010 100010010101010 100001101010101100 100001101010101100 000111000010011 000111000010011 !message : The foundational schematic to bring analytics to all stages in the data lifecycle can be overlaid with specific products that provide the functions Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouNeed Customer Enablement ? Education ? Send us an email
  • 18. # 3 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : Systems of Record Structured data from operational systems Transformational benefit / business outcomes come from integration of new data sources with traditional corporate data to find new insights Systems of Engagement Data that “connects” companies with their customers, partners and employees Systems of Insight Diverse data types that combine structured and unstructured data for business insight Streams BigInsights Architecture Positioning Why Power? Contacts/info In Memory Hadoop EDW Appliance # 17 Big data and YouNeed Architecture Workshop ? Sizing ? Send us an email
  • 19. # 18 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com Important to keep in mind Big Data (BigInsights, Cognos, SPSS, …) can run on IBM System z. Customers could take advantages of co-locating business data and OLAP data, managing high speed transactions and complex queries for real time operational analyticson a single integrated platform and take benefits of the performance, resiliency and quality of service of IBM Mainframe for critical businesses., as many banks/insurance customers !message : The infrastructure is a foundational piece to IBM’s perspective of delivering capabilities and offerings for BD&A Hadoop is Linux – Linux is Power Hadoop is cheap - Power is cheap Hadoop ecosystem – PowerLinux market acceptance Power advantages for Big Data Linux on Power – run the same commands as linux on x86 – versions release as the same date Linux on Power makes 17,6% of top 500 most linux powerful systems (with 5 in top 10) POWER8 increases performance, reliability and availability lead over Intel, alternative to intel OpenPower foundation brings Rapid innovation to Power Platform for open linux Little Endian support makes porting Linux on x86 applications even easier Power8 design point is for big data (more threads, more cache , more bandwidth, CAPI …) Intel design point is for multiple market (smart phone, tablet desktop PC, servers …) Streams BigInsights Architecture Positioning Why Power? Contacts/info Big data and YouFeel free to contact MOP PowerLinux center for more details
  • 20. # 20 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com IBM BigData RessourcesWw Competency Centers Big Data Analytics Links Web sites ibm.com/Hadoop Information Management Acceleration Zone PowerLinux Big Data IBM communities IBM Systems Big Data and Analytics BDSC practitioner wiki IBM Analytics Global Big Data& Analytics Clients References IBM Developper Works https://www.ibm.com/developerworks/analytics/ Please, Please Help us in improving this document – if any comments / ideas please feel free to send an email http://bigdatauniversity.com/ http://wikibon.org/wiki/v/Category:Big_Data http://en.wikipedia.org/wiki/Apache_Hadoop http://www.slideshare.net/search/slideshow? searchfrom=header&q=big+data [INFO] Based on 3 experienced years of big data projects , after many weeks of intensive work for compiling several presentations done to customers or conferences, synthetizing concepts, the objective of this educational paper is to clarify some of the concepts and solutions around Big Data in order to better understand the related challenges and opportunities. But There may be (so many) typing errors, mistakes, misleading words, missing concepts, so Please be kind  Streams Biginsights Architecture Positioning Why Power? Contacts/info Big data and YouIf we can not help you directly, we’ill point you to the right person
  • 21. > Strong history of leadership in open source & standards : IBM has always been a believer in standardization of interfaces to components of IT and application infrastructure (SQL, Eclipse, OpenPower …) > Supports our commitment to open source currency in all future releases > Accelerates IBM innovation within Hadoop & surrounding applications > Expecting Hortonworks, Pivotal distribution adoption on PowerLinux > The current ecosystem is challenged and slowed by fragmented and duplicated efforts. The ODP Core will take the guesswork out of the process and accelerate many use cases by running on a common platform. Freeing up enterprises and ecosystem vendors to focus on building business driven applications. # 21 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : ODP is clearly a major and strategic choice in Open community to accelerate Hadoop adoption and grow BigInsights and PowerLinux ecosystem / ISV NEW AND/OR HOT !!! OPEN DATA PLATFORM Big data and You What is Open Data Platform (ODP) ? > It is an Open-source, non-profit entity, focused, committed in evolving the current state of the platform, and delivering a Foundation certified, packaged, and tested Reference Distribution Why Open Data Platform (ODP) ? Where to position ODP vs Apache ? > ODP supports the Apache (ASF) mission > ASF provides a governance model around individual projects without looking at ecosystem > ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem Why IBM is involved in ODP ?
  • 22. # 22 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : IBM fundamental cloud strategy : Complete cloud offering, mixed between control and simplicity. Big data and You NEW AND /OR HOT !!! Big Data/Analytics and Cloud Customer Data Center (On-Premises) Cloud Data Center (Off Premises) SIMPLICITY CONTROL PureData for analytics DB2 BLU Infosphere Biginsights Cloudant DashDB Softlayer Cloudant DashDB Distributed NoSQL “Data Layer”, Powering Web, mobile, & IoT since 2009 Available as a fully-managed DBaaS, managed by you on-premises or hybrid Transactional JSON “document” database Spreads data across data centers & devices Ideal for apps that require: > Massive, elastic scalability > High availability > Geo-location services > Full-text search > Occasionally connected users Data warehouse and analytics as a service on the cloud • Next Generation In-Memory • Columnar • SIMD Hardware Acceleration • Actionable Compression • Support for OLAP SQL extensions • Connect common 3rd party BI tools dashDB keeps data warehouse infrastructure out of your way, allowing you to take benefits of :
  • 23. # 23 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : Spark is positioned as a fast and general engine for Big Data. It generalizes the MapReduce model and (could?)is poised to replace MapReduce Big data and You NEW AND/OR HOT !!! SPARK Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms. Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos.For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core. Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects
  • 24. # 24 IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com !message : From application point of view, data lake challenge is to be an unique and unified data repositories, queryable like a black box Big data and You NEW AND /OR HOT !!! DATA LAKE ARCHITECTURE IDC in late 2014 stated “By 2017 unified data platform architecture will become the foundation of BDA strategy. The unification will occur across information management, analysis, and search technology.”  A Data reservoir is a data lake that provides data to an organization for a variety of analytics processing including: • Discovery and exploration of data • Simple ad hoc analytics • Complex analysis for business decisions • Reporting • Real-time analytics  It is possible to deploy analytics into the data reservoir to generate additional insight from the data loaded into the data reservoir.  A data reservoir manages shared repositories of information for analytical purposes.  Each Data Reservoir Repository is optimized for a particular type of processing. • Real-time analytics, deep analytics (such as data mining), exploratory analytics, OLAP, reporting, … Example – Creating a logical warehouse Information virtualization hides the complexities of where the data is located. Here different repositories are being used to host different workloads, but this complexity is hidden by the information virtualization layer.