Data Science in the Enterprise

1© Cloudera, Inc. All rights reserved.
Data Science in the Enterprise
Amr Awadallah (@awadallah)
Founder, Chief Technical Officer, Cloudera

Typical Data Science Workflow
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition

• Team: Data scientists and analysts
• Goal: Understand data, develop and improve models,
share insights
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• End State: Reports, dashboards, PDF, MS Office
• Team: Data engineers, developers, SREs
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• End State: Online/production applications
Types of Data Science
Exploratory
(discover and quantify opportunities)
Operational
(deploy production systems)

Common Limitations
Access
Many times secured clusters are hard
for data science professionals to
connect either because they don’t
have the right permissions or
resources are to scarce to afford them
access. In addition popular
frameworks and libraries don’t read
Hadoop data formats out-of-the-box.
Scale
Notebook environments seldom
have large enough data storage for
medium, let alone big data. Data
scientists are often relegated to
sample data and constrained
when working on distributed
systems. Popular frameworks and
libraries don’t easily parallelize
across the cluster.
Developer Experience
Popular notebooks don’t work well
with access engines like Spark and
package deployment and
dependency management across
multiple software versions is often
hard to manage. Then once a model
is built there is no easy path from
model development to production

Management of Dependencies

Open Data Science in the Enterprise
IT
drive adoption while maintaining compliance
Data Scientist
explore, experiment, iterate

https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc

Introducing Cloudera Data Science Workbench
Self-service data science for the enterprise
Accelerates data science from
development to production with:
• Secure self-service environments
for data scientists to work against
Cloudera clusters
• Support for Python, R, and Scala,
plus project dependency isolation
for multiple library versions
• Workflow automation, version
control, collaboration and sharing

How does CDSW help?
Visualizeresults
ChangeandCompileSource
code
Retrainandredeploy
ExtensibleEngines
ConfigurableSessions
Trivialtotweakparameters
MultipleUsers
Roles/Governance
CDH

The Importance of an Open Ecosystem
Open Ecosystem Black Box

Demo

Key Benefits
How is Cloudera Data Science different?
Works with fully secured clusters
One tool for multiple standard languages (Python, R, Scala)
Multi-tenant Architecture
Common Platform

1
A conference for and by practicing data scientists!
Save the Date: July 20th at the Chapel, San Francisco
Wrangle is a 1 day, single track community event that hosts the best and
brightest in the Bay Area talking about the principles, practice, and
application of Data Science, across multiple data-rich industries. Join
Cloudera, Facebook, Netflix and more to discuss future trends, how they
can can be predicted, and most importantly—how can they be anticipated.
wrangleconf.com
#wrangleconf | Powered by Cloudera

Thank You
Amr Awadallah (@awadallah)

Data Science in the Enterprise

Related slideshows

More Related Content

Data Science in the Enterprise