Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
7. IBM Analytics
8
Machine Learning Workflow in Data Science Experience
Machine learning detects if models fall out of spec — and automatically triggers retraining
Fully integrated model management means data scientists, app developers & operations can use
the same environment
Machine Learning & Data Science
Data
Live
SystemIngest
Data
Processing
Model
Training
Deployment &
Management
Creating samples &
Cleansing
Automating Data Science Workloads Scalable
Deployment
Feedback Loop
Historical
Streaming
Data visualization
Feature transform
and engineering
Model selection
and evaluation
Pipelines, not
only models
Versioning
Predict when
given new data
Monitoring and
live evaluation
Models
lose accuracy
Data Scientists
+ Researchers
ML Engineers
+ Production Engineers
Data
Engineers
8. IBM Analytics
9
Data Science
Experience
Machine Learning Everywhere – An Open Platform
Add your favorite libraries
Publish Open APIs for secure ML applications
Machine Learning & Data Science
10. IBM Analytics
11
DSX Scale out in Kubernetes is simple
DSX-Spark scale-out is automatically done by adding more compute nodes (via “Daemon Sets”)
Remote Spark can be independently scaled out as usual (say in Hadoop/Yarn)
Individual workload Isolation and scale-out in pods
Each DSX individual user (or an entity, in general) gets a Kubernetes namespace assigned, making
metering simple.
All containers (pods) for that user gets spawned in that namespace, such as for tools – Jupyter/Zeppelin
(Python) or R/RStudio as well as other non-spark jobs.
Namespace provides total quota for that user with resource requests and limits set in each pod
deployment
“Shared” services are load balanced (with HA support) across all user access by typical
Kubernetes techniques, such as via replicas of pods & DNS-routing via Kubernetes services.
Machine Learning & Data Science
12. IBM Analytics
13
Data Science Experience with HDP –Roadmap
DSX & HDP interoperability
Side-by-Side Installation
DSX on-the-edge integrated &
optimized for HDP deployments
DSX Jupyter, RStudio &
Zeppelin and Machine Learning
services enabled for HDP data
sources
Yarn managed Spark leveraged
by DSX, via Livy
• Spark jobs pushed to HDP
cluster
Single Cluster
DSX Within HDP Cluster
Dedicated nodes for DSX in the
HDP cluster with Ambari-based
installation/configuration.
Deploy & scale DSX with Yarn
managing DSX as a top-level
application
Knox, Ranger & Atlas integration
for authentication, authorization &
governance
Fully Yarn Managed DSX
Workloads
HDP embeds Kubernetes in Yarn,
enables launch and integration of
Kubernetes pods as Yarn
containers
Yarn manages all workloads in a
granular fashion across the entire
HDP cluster
• Python & R workloads (non-
spark) also managed by Yarn
• GPU affinity , especially for
Deep Learning Jobs
Today Q4 2017 1st Half 2018
1
Machine Learning & Data Science
13. IBM Analytics
14
Goal: Enterprise IaaS for Data Scientists
Efficient Compute Resource Management for large-scale Analytics, Machine Learning and Deep
Learning workloads
-Enable Data Scientists to procure resources from a shared compute “grid” for any kind of activity from
interactive notebooks & IDEs to training Jobs or scheduled scripts and Apps.
-All compute manifested as Docker containers/Kubernetes pods
HDP/Yarn as the Resource Manager
-Enable all workloads, whether Map Reduce or Spark Jobs or DSX/ML activities to be uniformly handled by
the HDP/Yarn scheduler.
-Manage Queue Priorities, balancing of workloads and scale-out for the whole cluster providing best
utilization of all resources.
Yarn and Kubernetes - the best of both worlds !
Machine Learning & Data Science