Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
1
©2017 Talend Inc
Greg Meimers
Steve Biernbaum
Big Data
2
©2017 Talend Inc
Demo
3
• Open your mobile phone’s browser & navigate to
http://snowflake.talend.live
Enter the session code only and click Submit; do not continue
Setup
4
• Open your mobile phone’s browser & navigate to
http://devicemotion.xyz
• Enter the session code only and click Submit; do not continue
To participate:
5
• Enter your first name only (no spaces or special characters)
Don’t click Submit until instructed
Setup
6
Collect, aggregate, categorize
sensor data in real-time…
…from your mobile phone
Today’s Goal
7
Javascript
reads
devicemotion
events
Stream micro-
batches to
REST service
REST service
sends data to
Kafka
Spark
Streaming
reads from
Kafka
Apply Machine
Learning to
classify activity
Load into Data
Warehouse
Visualization
data obtained
from REST
service
How Are We Collecting?
{REST} {REST}
8
• It let's you publish and subscribe to
streams of records. In this respect it
is similar to a message queue or
enterprise messaging system.
• It let's you store streams of records in
a fault-tolerant way.
• It let's you process streams of records
as they occur.
Distributed Streaming Platform
Kafka Background
9
• Fast and general engine for large-scale data processing
• Developed in response to processing limitations with MapReduce
• 10x faster than MapReduce on disk
• 100x faster than MapReduce in memory
• Has a stack of libraries including Spark Streaming & MLib (machine learning)
• Runs everywhere; on Hadoop or Standalone
Spark Background
10
• University study on gait (walking) characteristics based on smartphone sensors
proposed that each individual has a unique walking signature
• Showing a heat-trace on three individuals reveals their unique signature
Biometric Gait Signature
1 http://www.mdpi.com/2073-8994/8/10/100
2 http://kyrandale.com/viz/d3-smartphone-walking.html
11
A Single Sensor
InvenSense MPU-6500 (Galaxy S6)
• Single-chip (3mm x 3mm x 0.9 mm)
integrates a 3-axis accelerometer
and a 3-axis gyroscope
• For comparison
18mm 3mm
12
Linear Acceleration
• Shows forces measured by the accelerometer that
are caused by gravity
• The x, y and z axis show the direction of the force
• As you hold a phone looking at the screen…
• x is relative to the left and right sides
• y is relative to the up and down sides
• z is relative to the front and back sides
• If the phone is still, the linear acceleration values
should all be close to 0
• If you move it around it shows in real time how
much force is applied on it in the form of
acceleration
What Are We Collecting?
13
• The devicemotion event is fired at a regular interval and indicates the
amount of physical force of acceleration the device is receiving at that time
• The information being transmitted is sent in JSON payloads every 250 events
(~5 seconds):
JavaScript devicemotion Events
"motionData":[
{
"client_ip":"127.0.0.1",
"timestamp":"1723452955",
"aX":"1.4",
"aY":"0.9",
"aZ":"3.1",
"user_name":"Name"
},
...
]
14
Deduplication & Matching using Machine Learning to Scale to Big Data
Data Quality with Machine Learning
Training set
Single data set
with duplicates
Prediction of
potential
duplicates
Manual labeling: “is this a
duplicate?” yes/no
Run model
(Random Forests)
Train model
SAMPLE
ALL DATA
sampling
Continuous learning: the more data, the better the system learns
15
• Linear acceleration on x, y, z axes (m/s2)
• Data classified into 3 categories
• Resting
• Walking
• Running
• Approximately 450 events
Training Data
aX,aY,aZ,label
-4.1,8.07,-16.36,running
-2.34,9.69,-0.33,running
0.0,0.01,-0.01,resting
-2.38,-0.54,0.65,walking
-0.7,12.93,-4.91,running
-3.3,-0.89,5.27,walking
1.85,-1.37,-0.73,walking
0.01,0.0,0.0,resting
…
16
• Encode the model by using the previous handmade classified dataset
• Choose an appropriate algorithm for classification:
• Logistic Regression, Naïve Bayes, Decision Tree, Random Forest
• Validate algorithm using K-Fold Cross Validation
Encoding and Validating a Model
aX,aY,aZ,label
-4.1,8.07,-16.36,running
-2.34,9.69,-0.33,running
0.0,0.01,-0.01,resting
-2.38,-0.54,0.65,walking
-0.7,12.93,-4.91,running
-3.3,-0.89,5.27,walking
1.85,-1.37,-0.73,walking
0.01,0.0,0.0,resting
…
17
5 Ways to Exploit Your Big Data
Spark
Streaming
Batch &
Real-Time
In Memory
Machine
Learning
1 click code
migration
Analyze before acting
Turn data into
decisions, prescriptions
& actions
Leverage the latest
technology
Remove latency
Exploit data as it arrives
18
SUPPLIERS
CUSTOMERS
CLOUD
SENSORS
PREMISE
19
A Modern Big Data and Cloud Integration Platform
Data Fabric
APPLICATION
INTEGRATION
CLOUD
INTEGRATION
METADATA
MANAGEMENT
DATA
PREPARATION
BIG DATA
INTEGRATION
MASTER DATA
MANAGEMENT
20
Check Authorization
Big Data Architecture
Get Software Updates &
Publish Artifacts
Store Metadata
Store Users, Rights, Roles,
Projects, Activity, Monitoring
Send & Request
Artifacts/Jobs
Job Server can be inside
or outside the cluster
Setup deployment
21
UNIFIED PLATFORM
BATCH STREAMING HADOOP SPARK MAPREDUCE
INGEST PROFILE CLEANSE PARSE COMPLEX DATA
MAPPING
DATA QUALITY METADATA MANAGEMENT DATA LINEAGE
DESIGN DEPLOY MANAGE
ON-PREMISES PUBLIC CLOUD PRIVATE CLOUD
DATA GOVERNANCE
CONTINUOUS DELIVERY
DEPLOYMENT
BIG DATA
INTEGRATION
Big Data
22
Talend Development Environment
• Talend Studio
o Eclipse Based Design Environment
o Drag and Drop UI
o Distributed Teamwork / Collaboration
o Rich palette of connectors : 800+
• N-Tier Architecture
o Client: Talend Studio
o Project Server: Talend Administration Center
o ETL Server: Talend Runtime
• Talend Administration Center
o Define Users and Projects (LDAP Enabled)
o Deploy
o Schedule
o Recover Job execution
o Monitor
23
Create High Quality Information
• Data Quality and Profiling
• Explore, profile and monitor data
• Parse, cleanse, standardize and reconcile data
• Match, enrich and certify data, then and share it
widely and securely
• Map any data source to your business context
(customers products, organizations locations…)
• Data Masking
• Key Benefits
• More accurate information
• Regulatory compliance
24
Talend Data Preparation
The first unified integration platform for governed, self-service data preparation
• Self-service data access & cleansing
+ Enterprise scale through Talend Data Fabric
+ Collaboration and sharing across teams
+ IT governs data usage with role-based security
+ Turn ad-hoc data prep into fully managed DI
processes
+ Ready for Big Data
LIVE DATA-SET
…and more
25
The First Self-Service Data Quality Tool
Talend Data Stewardship App
Establish accountability and perfect data through teamwork
+ Engage everyone for data quality, not just data
stewards
+ Point & click approach for curation and
certification
+ Orchestrate data stewardship tasks as
campaigns
+ Audit and track data error resolution actions
26
Talend Data Preparation
Data cleaning and transformation for data analysts. Simple and powerful.
27
TIC Architecture: Connecting SaaS & Cloud Platforms
Templates
Integration
Flows
Cloud Engines SaaS App
On-premises apps & databases
Metadata in transit (HTTPS)
Customer data in transit
Firewall Firewall
Cloud Platforms
Multi-tenant
Web
Application
Talend Studio
28
TIC Architecture – Hybrid Integration
Templates
Integration
Flows
SaaS App
On-premises apps & databases
Metadata in transit (HTTPS)
Customer data in transit
Firewall Firewall
Status and Logs (HTTPS)
Remote Engines
Cloud Platforms
Multitenant
Web
Application
Talend Studio
29
©2017 Talend Inc
-Q&A

More Related Content

Big data - Talend presentation to STLHUG

  • 1. 1 ©2017 Talend Inc Greg Meimers Steve Biernbaum Big Data
  • 3. 3 • Open your mobile phone’s browser & navigate to http://snowflake.talend.live Enter the session code only and click Submit; do not continue Setup
  • 4. 4 • Open your mobile phone’s browser & navigate to http://devicemotion.xyz • Enter the session code only and click Submit; do not continue To participate:
  • 5. 5 • Enter your first name only (no spaces or special characters) Don’t click Submit until instructed Setup
  • 6. 6 Collect, aggregate, categorize sensor data in real-time… …from your mobile phone Today’s Goal
  • 7. 7 Javascript reads devicemotion events Stream micro- batches to REST service REST service sends data to Kafka Spark Streaming reads from Kafka Apply Machine Learning to classify activity Load into Data Warehouse Visualization data obtained from REST service How Are We Collecting? {REST} {REST}
  • 8. 8 • It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system. • It let's you store streams of records in a fault-tolerant way. • It let's you process streams of records as they occur. Distributed Streaming Platform Kafka Background
  • 9. 9 • Fast and general engine for large-scale data processing • Developed in response to processing limitations with MapReduce • 10x faster than MapReduce on disk • 100x faster than MapReduce in memory • Has a stack of libraries including Spark Streaming & MLib (machine learning) • Runs everywhere; on Hadoop or Standalone Spark Background
  • 10. 10 • University study on gait (walking) characteristics based on smartphone sensors proposed that each individual has a unique walking signature • Showing a heat-trace on three individuals reveals their unique signature Biometric Gait Signature 1 http://www.mdpi.com/2073-8994/8/10/100 2 http://kyrandale.com/viz/d3-smartphone-walking.html
  • 11. 11 A Single Sensor InvenSense MPU-6500 (Galaxy S6) • Single-chip (3mm x 3mm x 0.9 mm) integrates a 3-axis accelerometer and a 3-axis gyroscope • For comparison 18mm 3mm
  • 12. 12 Linear Acceleration • Shows forces measured by the accelerometer that are caused by gravity • The x, y and z axis show the direction of the force • As you hold a phone looking at the screen… • x is relative to the left and right sides • y is relative to the up and down sides • z is relative to the front and back sides • If the phone is still, the linear acceleration values should all be close to 0 • If you move it around it shows in real time how much force is applied on it in the form of acceleration What Are We Collecting?
  • 13. 13 • The devicemotion event is fired at a regular interval and indicates the amount of physical force of acceleration the device is receiving at that time • The information being transmitted is sent in JSON payloads every 250 events (~5 seconds): JavaScript devicemotion Events "motionData":[ { "client_ip":"127.0.0.1", "timestamp":"1723452955", "aX":"1.4", "aY":"0.9", "aZ":"3.1", "user_name":"Name" }, ... ]
  • 14. 14 Deduplication & Matching using Machine Learning to Scale to Big Data Data Quality with Machine Learning Training set Single data set with duplicates Prediction of potential duplicates Manual labeling: “is this a duplicate?” yes/no Run model (Random Forests) Train model SAMPLE ALL DATA sampling Continuous learning: the more data, the better the system learns
  • 15. 15 • Linear acceleration on x, y, z axes (m/s2) • Data classified into 3 categories • Resting • Walking • Running • Approximately 450 events Training Data aX,aY,aZ,label -4.1,8.07,-16.36,running -2.34,9.69,-0.33,running 0.0,0.01,-0.01,resting -2.38,-0.54,0.65,walking -0.7,12.93,-4.91,running -3.3,-0.89,5.27,walking 1.85,-1.37,-0.73,walking 0.01,0.0,0.0,resting …
  • 16. 16 • Encode the model by using the previous handmade classified dataset • Choose an appropriate algorithm for classification: • Logistic Regression, Naïve Bayes, Decision Tree, Random Forest • Validate algorithm using K-Fold Cross Validation Encoding and Validating a Model aX,aY,aZ,label -4.1,8.07,-16.36,running -2.34,9.69,-0.33,running 0.0,0.01,-0.01,resting -2.38,-0.54,0.65,walking -0.7,12.93,-4.91,running -3.3,-0.89,5.27,walking 1.85,-1.37,-0.73,walking 0.01,0.0,0.0,resting …
  • 17. 17 5 Ways to Exploit Your Big Data Spark Streaming Batch & Real-Time In Memory Machine Learning 1 click code migration Analyze before acting Turn data into decisions, prescriptions & actions Leverage the latest technology Remove latency Exploit data as it arrives
  • 19. 19 A Modern Big Data and Cloud Integration Platform Data Fabric APPLICATION INTEGRATION CLOUD INTEGRATION METADATA MANAGEMENT DATA PREPARATION BIG DATA INTEGRATION MASTER DATA MANAGEMENT
  • 20. 20 Check Authorization Big Data Architecture Get Software Updates & Publish Artifacts Store Metadata Store Users, Rights, Roles, Projects, Activity, Monitoring Send & Request Artifacts/Jobs Job Server can be inside or outside the cluster Setup deployment
  • 21. 21 UNIFIED PLATFORM BATCH STREAMING HADOOP SPARK MAPREDUCE INGEST PROFILE CLEANSE PARSE COMPLEX DATA MAPPING DATA QUALITY METADATA MANAGEMENT DATA LINEAGE DESIGN DEPLOY MANAGE ON-PREMISES PUBLIC CLOUD PRIVATE CLOUD DATA GOVERNANCE CONTINUOUS DELIVERY DEPLOYMENT BIG DATA INTEGRATION Big Data
  • 22. 22 Talend Development Environment • Talend Studio o Eclipse Based Design Environment o Drag and Drop UI o Distributed Teamwork / Collaboration o Rich palette of connectors : 800+ • N-Tier Architecture o Client: Talend Studio o Project Server: Talend Administration Center o ETL Server: Talend Runtime • Talend Administration Center o Define Users and Projects (LDAP Enabled) o Deploy o Schedule o Recover Job execution o Monitor
  • 23. 23 Create High Quality Information • Data Quality and Profiling • Explore, profile and monitor data • Parse, cleanse, standardize and reconcile data • Match, enrich and certify data, then and share it widely and securely • Map any data source to your business context (customers products, organizations locations…) • Data Masking • Key Benefits • More accurate information • Regulatory compliance
  • 24. 24 Talend Data Preparation The first unified integration platform for governed, self-service data preparation • Self-service data access & cleansing + Enterprise scale through Talend Data Fabric + Collaboration and sharing across teams + IT governs data usage with role-based security + Turn ad-hoc data prep into fully managed DI processes + Ready for Big Data LIVE DATA-SET …and more
  • 25. 25 The First Self-Service Data Quality Tool Talend Data Stewardship App Establish accountability and perfect data through teamwork + Engage everyone for data quality, not just data stewards + Point & click approach for curation and certification + Orchestrate data stewardship tasks as campaigns + Audit and track data error resolution actions
  • 26. 26 Talend Data Preparation Data cleaning and transformation for data analysts. Simple and powerful.
  • 27. 27 TIC Architecture: Connecting SaaS & Cloud Platforms Templates Integration Flows Cloud Engines SaaS App On-premises apps & databases Metadata in transit (HTTPS) Customer data in transit Firewall Firewall Cloud Platforms Multi-tenant Web Application Talend Studio
  • 28. 28 TIC Architecture – Hybrid Integration Templates Integration Flows SaaS App On-premises apps & databases Metadata in transit (HTTPS) Customer data in transit Firewall Firewall Status and Logs (HTTPS) Remote Engines Cloud Platforms Multitenant Web Application Talend Studio