1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
4. Big Data
React to an Event Pro-Actively Change Outcomes
“Technology presents the opportunity
to transform business“*
Mark Hurd, President, Oracle
* Oracle Profit Magazine, Volume 17, Number 1
5. Big Data’s Key Ingredient
“ Improvement merely lets you Big Data transforms
hit the numbers. Creativity is our business 5%
what transforms.“*
Ron Johnson, CEO, JCPenney
Big Data improves
our business 20%
What is Big Data?
75%
* Fortune Magazine VOL. 165, NO. 4
6. Big Data Extends the Breadth and Speed of Data
Video and Images
Big Data:
Decisions based Documents
on all your data
Social Data
Machine-Generated Data
Information
Architectures
Today: Transactions
Decisions based
on database data
7. Big Data Extends the Depth of Analytics
Graph Analytics
Statistics
Query and Reporting Data Mining
2 miles
Spatial Analytics
Text Analytics
8. Big Data Defined
Big Data: Techniques and
Technologies that Enable Enterprises
to Effectively and Economically
Analyze All of their Data
12. Oracle Big Data Strategy
BI Tools
Semantic Text
CEP Data
& Advanced
RTD
Management Analytics
Graph Spatial
Data Discovery Tools
Management Infrastructure
Build Acquire Adopt Engineer
14. Big Data Appliance
Hardware:
• 288 CPU cores with 1152 GB RAM
• 648 TB of raw disk storage
• 40 Gb/s InfiniBand
Integrated Software:
• Oracle Linux
• Oracle Java VM
• Cloudera Distribution of Apache Hadoop (CDH)
• Cloudera Manager
• Open-source distribution of R
• NoSQL Database Community Edition
All integrated software (except NoSQL DB CE) is supported as part of Premier Support for Systems and Premier Support for
Operating Systems
15. Oracle Big Data Appliance
File System Mount UI Framework SDK
FUSE-DFS HUE HUE SDK
Workflow Scheduling Metadata
APACHE OOZIE APACHE OOZIE APACHE HIVE
Languages / Compilers
APACHE PIG, APACHE HIVE, APACHE MAHOUT
Fast
Data
Read/Write
Integration
Access
APACHE
FLUME, APACHE APACHE HBASE
SQOOP
HDFS, MAPREDUCE
Coordination
APACHE ZOOKEEPER
16. Why Cloudera?
• Includes Open Source Apache Hadoop
– Fast evolution in critical features
– Proven at very large scale
• Managed Distribution
– Components certified to work together in regular updates
– Cloudera Manager provides Management GUI
• Most popular distribution in the market
17. Oracle and Cloudera
• All Cloudera software pre-installed and pre-configured
on BDA
– Engineered with Cloudera
• All Cloudera assets included
– Single Oracle Product SKU for HW & SW
– Single Oracle Support SKU for HW & SW (life of the machine)
• Oracle is the single point of contact for the solution
18. Price comparison
Oracle Big Data Appliance “Build-Your-Own” – HP hardware and Cloudera
Year 1 Year 2 Year 3 Total Year 1 Year 2 Year 3 Total
Servers and
BDA Cost $450,000 $428,220
switches
Support
$54,000 $54,000 $54,000 Support Cost $136,233 $72,000 $72,000
Cost
On-site Installation &
Installation $14,150 configuration
not included
Total $518,150 $54,000 $54,000 $626,150 Total $564,453 $72,000 $72,000 $708,453
Full details at https://blogs.oracle.com/datawarehousing/entry/price_comparison_for_big_data
19. Oracle NoSQL Database
A distributed, scalable key-value database
• Simple Data Model
• Key-value pair with major+sub-key paradigm Application Application
• Read/insert/update/delete operations NoSQLDB Driver NoSQLDB Driver
• Scalability
• Dynamic data partitioning and distribution
• Optimized data access via intelligent driver
• High availability
• One or more replicas
• Disaster recovery through location of replicas
• Resilient to partition master failures
• No single point of failure
• Transparent load balancing Storage Nodes Storage Nodes
• Reads from master or replicas Data Center A Data Center B
• Driver is network topology & latency aware
20. Big Data Connectors
Optimized integration of Hadoop with Oracle Database
and Oracle Exadata
• Oracle Loader for Hadoop
• Oracle Direct Connector for Hadoop Distributed File System
(HDFS)
• Oracle Data Integrator Application Adapter for Hadoop
• Oracle R Connector for Hadoop
• Does not require Big Data Appliance – can be licensed for
Hadoop running on non-Oracle hardware
21. Oracle Loader for Hadoop
Use The Cluster
ORACLE LOADER FOR HADOOP
MAP
REDUCE
MAP Last stage in MapReduce
MAP
SHUFFLE
/SORT
REDUCE workflow
Partitioned and non-
MAP REDUCE partitioned tables
MAP REDUCE
SHUFFLE
MAP /SORT REDUCE
Online and offline loads
22. Oracle Direct Connector for HDFS
Direct Access from Oracle Database
HDFS Oracle Database
SQL Query
SQL access to HDFS
External
Table External table view
Data query or import
DCH
DCH
HDFS
Infini
Band DCH
Client
23. Oracle Data Integrator
Simplifying MapReduce
Oracle
Data
Integrator Automatically generates
MapReduce code
Oracle
Loader for Manages the process
Hadoop
Loads into Data Warehouse
24. What is Data Discovery?
Simplified
Quickly explore all relevant data
Relationships Advanced search Structured
undefined or unknown Faceted navigation Semi-structured
No pre-defined model Analytics Unstructured
required Messy data
Rapid, iterative change Beyond the data
warehouse
25. Business Intelligence and Data Discovery
Complementary Solutions, Integrated Business Processes
Known & Clearly Uncertain or
Defined Questions Open-Ended Questions
Who, What, When? Why, How, What Else?
Un-modeled Data Insights yield
Data Discovery
mature models
Diverse and Changing Models and KPIs
Fast Answers to New Questions
New questions
Modeled Data Business Intelligence
require new
Proven Answers to Known
Conforms to a Single Model Questions
data, explorati
on
26. Oracle Endeca Information Discovery
A platform for data discovery applications across the enterprise
Endeca Information Discovery
(EID) helps organizations
quickly explore all relevant data
• Combine structured & unstructured
data from disparate systems
• Rapidly assemble easy to use
analysis applications
• Automatically organize information
for search, discovery & analysis
Hadoop, you may want to either access that data from Oracle Database by issuing SQL against HDFS files or by moving the data into Oracle tables.Lets start with the latter -- moving the data into Oracle tables. Oracle Loader for Hadoop (or OLH) is a high performance loader for fast movement of data from any Hadoop cluster into Oracle Database tables. Like all other parts the Big Data Connectors, it is available on any Hadoop cluster based on Apache Hadoop in addition to the Big Data Appliance.If you want to take the results and perform additional analysis using advanced BI and data warehousing technologies or incorporate in other applications, OLH is both fast and reduces the processing load on the Database server. It runs as a map reduce job and uses the Hadoop server’s processing resources to sample, sort and pre-partition the data based on the target database metadata. It can automatically take input in delimited text files (CSV) or Hive tables or you can write your own input format. OLH can either directly load the results into the database using the parallel direct path load interface or JDBC, create Oracle formatted Datapump files. OLH has built into load balancing across the reducer nodes that prevents performance from degrading due to unbalanced loads.
Oracle Direct Connector for HDFS makes it possible to access to data on the Hadoop cluster in HDFS from Oracle using SQL. It provides a virtual table view of the HDFS files and the allows for parallel query access to data using the standard Oracle database external table mechanism. If you are using BDA and Exadata, the connectivity occurs using infiniband network fabric so the database access to HDFS, in the very scientific words of the development manager, “flies”. If you need to import the data in HDFS into Oracle, the Direct Connector does not require a file copy and without using Linux Fuse. Instead it uses the native Oracle Loader interface.
If you already use Oracle Data Integrator (or are familiar with this kind of tool and want to use ODI), then it can simplify the MapReduce process.As long as you can describe the transformation that you need to perform on the data, ODI can generate the MapReduce code for you and run that process. It can even invoke Oracle Loader for Hadoop at the end of the cycle.So if you are not an expert in Java, parallel algorithms and the Hadoop framework, there is still a way to use it all to organize your code.Note:ODI generates SQL code which is then passed into Hive (a component of many Hadoop distributions) which generates the actual Java MapReduce codeYou need Big Data Connectors, specifically the ODI Application Adaptor for Hadoop, to make all this work
Our view of the BI landscape is that there are fundamentally two dominant types of problems.On one hand there are questions where we can define up-front both the process and the data required to answer them. What are sales forecasts by region? What is my performance relative to expectation?On the other hand are questions where either the process or the data cannot be defined ahead of time; these questions are open ended by nature. What customers should I target? Why are my sales going down? It's also interesting to point out that these questions are far more transient than the other type, and this follows from their open ended nature. Each question leads to new questions. The interaction model for the former is more like “looking it up”; it’s a report or dashboard. On the other hand, when you don't know exactly what you need or how to ask for ii, the necessary interaction model is exploration and discovery. A dialog with the data.<transition>It also follows that, as a matter of practice, some data is modeled and other data is not. We take modeled to mean that there is a single, overarching semantic model. Of course, modeling costs time and money and so we generally only make the investment in cases where the expected return on that investment is large enough to justify the effort.The cost of storing un-modeled data has continued to drop but importantly, with the popularization of Hadoop, the promise of deriving value from un-modeled data is rising rapidly. The result is an explosion in the capture of un-modeled data.Through this view of the BI landscape we can see how Traditional Business Intelligence and Data Discovery fit in.<transition>Traditional Business Intelligence is purpose built and very strong for known questions and modeled data. Friction arises when organizations attempt to use these products for new and unpredictable questions, which require similarly new and unpredictable data models to meet the need.<transition>In the other space is the emerging market category of data discovery, where the goal is to provide everyday business users with fast answers to new questions to make better, more informed business decisions. Data discovery tools follow several key market trends:First, the growth in data volume, diversity, and complexity. Not much to say here that hasn't already been said. Organizations today are beginning to understand the value inherent in this information and are looking for tools that can unlock that value to give them competitive advantage. And more and more users need to access and understand this information.Second, the consumerization of business software. When IT is unable to deliver, business users are increasingly willing to go outside of IT in order to meet their own needs. Empowered with their choice of tools, and with expectations formed in the consumer world, expectations for amazing user experiences have never been higher.
How do we do it. Endeca Information Discovery provides a full featured platform for creating discovery applications that provide access to all kinds of informationDrilling into the architecture, we accomplish this with three tiers
Notes:This slide is a logical representation of the scope of a Big Data solution. It provides the basis for describing data flows in each stage of the Big Data process in the following slides.The scope of a Big Data solution includes taking actions and decisions on the results of analysis, hence integration with Applications.Real-time event detection can be part of a Big Data solution. This is an important point to draw out because IBM claims it’s Steams capability is a USP, see the book Understanding Big Data, Analytics for Enterprise Class Hadoop and data Streaming.