Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
PowerBI & R
Łukasz Grala
Architect Data Platform & Advanced Analytics & BI Solutions
Data Platform MVP
Uniwersytet Ekonomiczny w Poznaniu
2016-03-17
@Łukasz Grala – lukasz@tidk.pl
• Architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych,
uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
Type of Analytics
lukasz@tidk.pl
BI and Analytics PlatformAdvanced Analytics
Gartner MQ
Data Warehouse
New BI Solutions
ETL Tool
(SSIS, etc) EDW
(SQL Server, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
lukasz@tidk.pl
lukasz@tidk.pl
• Free and open source R distribution
• Enhanced and distributed by Revolution Analytics
Microsoft R Open
Microsoft R Products
lukasz@tidk.pl
• Built in Advanced Analytics and Stand Alone Server Capability
• Leverages the Benefits of SQL 2016 Enterprise Edition
SQL Server R Services
Microsoft R Products
lukasz@tidk.pl
Microsoft R Server
• Microsoft R Server for Redhat Linux
• Microsoft R Server for SUSE Linux
• Microsoft R Server for Teradata DB
• Microsoft R Server for Hadoop on Redhat
Microsoft R Server
Introducing SQL Server 2016 R Services
Enterprise speed and
performance
Near-DB analytics
Parallel threading and
processing
Model on-premises, store
in cloud—or vice versa
Hybrid memory and disk
scalability
Not bound by memory-
enabling limits of larger
datasets
Included in SQL Server 2016
Reuse and optimize existing
R code
Eliminate data movement
across machines
Write once, deploy
anywhere
Scalable in-database analytics
Data Scientist
Interacts directly with data
Creates models
and experiments
Data Analyst/DBA
Manages data and
analytics together
Example Solutions
• Fraud detection
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
010010
100100
010101
Relational Data
Extensibility
?
R
R Integration
Analytic Library
Open Source R
Revolution PEMA
T-SQL Interface
How is it Integrated?
• T-SQL calls a Stored Procedure
• Script is run in SQL through
extensibility model
• Result sets sent through Web API to
database or applications
Benefits
• Faster deployment of ML models
• Less data movement, faster insights
• Work with large datasets: mitigate R
memory and scalability limitations
PowerBI
PowerBI Desktop
PowerBI
PowerBI Pro
Dashboard
lukasz@tidk.pl
lukasz@tidk.pl
Custom Visualization Gallery https://app.powerbi.com/visuals
lukasz@tidk.pl
Publish to Web
Anonymous Embedding
lukasz@tidk.pl
Storytelling with Sway
Anonymous Embedding
• Multithreaded library replaces standard
BLAS/LAPACK algorithms
• Intel MKL on Windows/Linux ; Accelerate on Mac
• High-performance algorithms
• Sequential  Parallel
• Uses as many threads as there are available cores
• No need to change any R code
• Included with RRO binary distributions
Multi-threaded performance
16
lukasz@tidk.pl
ScaleR - Performance comparison
Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed
RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.
 US flight data for 20 years
 Linear Regression on Arrival Delay
 Run on 4 core laptop, 16GB RAM and 500GB SSD
lukasz@tidk.pl
DistributedR
ScaleR
ConnectR
DevelopR
Distributed R - Model development and model compute choice:
“Write Once. Deploy Anywhere.”
Code Portability Across Platforms
In the Cloud
Workstations & Servers Linux
Windows
EDW Teradata
Hadoop
Hortonworks
Cloudera
MapR
+ HD Insights
+ Hadoop Spark
+ R Tools for
Visual Studio
+ Azure ML
Roadmap
Azure Marketplace
+ SQL Server v16
MicrosoftRServer
lukasz@tidk.pl
Datasize
In-memory
In-memory In-Memory or Disk Based
Speed of Analysis
Single threaded Multi-threaded
Multi-threaded, parallel processing
1:N servers
Support
Community Community Community + Commercial
Analytic Breadth &
Depth 7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-speed
functions
Licence
Open Source
Open Source
Commercial license.
Supported release with indemnity
CRAN, MRO, MRS Comparison
Microsoft
R Open
Microsoft
R Server
lukasz@tidk.pl
• More efficient and multi-threaded math computation.
• Benefits math intensive processing.
• No benefit to program logic and data transform
CRAN R compared to Microsoft R Open
• Matrix calculation – upto 27x faster
• Matrix functions – upto 16x faster
• Programation – 0x faster
lukasz@tidk.pl
 Naïve Bayes
ScaleR Functions & Algorithms
 Data import – Delimited, Fixed, SAS, SPSS, OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions &
link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models
 K-Means
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
 PEMA-R API
 rxDataStep
 rxExec
Microsoft and R
• Microsoft Open R/R Server
• SQL Server 2016
• Reporting Services & Mobile Reports
• PowerBI
• Azure Data Lake Storage & Analytics
• Azure Machine Learning
• Azure HDInsight
lukasz@tidk.pl
Question?
lukasz@tidk.pl
• 16-18 maj 2016
• Wrocław Centrum Konferencyjne
• 3 dni, 6 warsztatów, 4 ścieżki, ponad 30 prelegentów, 50 sesji
• 600 uczestników + sponsorzy + prelegenci + organizatorzy
• Goście między innymi z USA, Anglii, Niemiec, Ukrainy, Bułgarii, Słoweni
• Premiera techniczna SQL Server 2016
sqlday.pl @sqlday
lukasz@tidk.pl
W tym warsztat Big Data Analytics – Łukasz Grala & Marcin Szeliga

More Related Content

20160317 - PAZUR - PowerBI & R

  • 1. PowerBI & R Łukasz Grala Architect Data Platform & Advanced Analytics & BI Solutions Data Platform MVP Uniwersytet Ekonomiczny w Poznaniu 2016-03-17
  • 2. @Łukasz Grala – lukasz@tidk.pl • Architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK • Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach • Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów • Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP • Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe) • Prelegent na licznych konferencjach w kraju i na świecie • Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…) • Członek Polskiego Towarzystwa Informatycznego • Członek i lider Polish SQL Server User Group (PLSSUG) • Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
  • 4. BI and Analytics PlatformAdvanced Analytics Gartner MQ Data Warehouse
  • 5. New BI Solutions ETL Tool (SSIS, etc) EDW (SQL Server, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data lukasz@tidk.pl
  • 6. lukasz@tidk.pl • Free and open source R distribution • Enhanced and distributed by Revolution Analytics Microsoft R Open Microsoft R Products
  • 7. lukasz@tidk.pl • Built in Advanced Analytics and Stand Alone Server Capability • Leverages the Benefits of SQL 2016 Enterprise Edition SQL Server R Services Microsoft R Products
  • 8. lukasz@tidk.pl Microsoft R Server • Microsoft R Server for Redhat Linux • Microsoft R Server for SUSE Linux • Microsoft R Server for Teradata DB • Microsoft R Server for Hadoop on Redhat Microsoft R Server
  • 9. Introducing SQL Server 2016 R Services Enterprise speed and performance Near-DB analytics Parallel threading and processing Model on-premises, store in cloud—or vice versa Hybrid memory and disk scalability Not bound by memory- enabling limits of larger datasets Included in SQL Server 2016 Reuse and optimize existing R code Eliminate data movement across machines Write once, deploy anywhere
  • 10. Scalable in-database analytics Data Scientist Interacts directly with data Creates models and experiments Data Analyst/DBA Manages data and analytics together Example Solutions • Fraud detection • Sales forecasting • Warehouse efficiency • Predictive maintenance 010010 100100 010101 Relational Data Extensibility ? R R Integration Analytic Library Open Source R Revolution PEMA T-SQL Interface How is it Integrated? • T-SQL calls a Stored Procedure • Script is run in SQL through extensibility model • Result sets sent through Web API to database or applications Benefits • Faster deployment of ML models • Less data movement, faster insights • Work with large datasets: mitigate R memory and scalability limitations
  • 13. lukasz@tidk.pl Custom Visualization Gallery https://app.powerbi.com/visuals
  • 16. • Multithreaded library replaces standard BLAS/LAPACK algorithms • Intel MKL on Windows/Linux ; Accelerate on Mac • High-performance algorithms • Sequential  Parallel • Uses as many threads as there are available cores • No need to change any R code • Included with RRO binary distributions Multi-threaded performance 16
  • 17. lukasz@tidk.pl ScaleR - Performance comparison Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.  US flight data for 20 years  Linear Regression on Arrival Delay  Run on 4 core laptop, 16GB RAM and 500GB SSD
  • 18. lukasz@tidk.pl DistributedR ScaleR ConnectR DevelopR Distributed R - Model development and model compute choice: “Write Once. Deploy Anywhere.” Code Portability Across Platforms In the Cloud Workstations & Servers Linux Windows EDW Teradata Hadoop Hortonworks Cloudera MapR + HD Insights + Hadoop Spark + R Tools for Visual Studio + Azure ML Roadmap Azure Marketplace + SQL Server v16 MicrosoftRServer
  • 19. lukasz@tidk.pl Datasize In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high-speed functions Licence Open Source Open Source Commercial license. Supported release with indemnity CRAN, MRO, MRS Comparison Microsoft R Open Microsoft R Server
  • 20. lukasz@tidk.pl • More efficient and multi-threaded math computation. • Benefits math intensive processing. • No benefit to program logic and data transform CRAN R compared to Microsoft R Open • Matrix calculation – upto 27x faster • Matrix functions – upto 16x faster • Programation – 0x faster
  • 21. lukasz@tidk.pl  Naïve Bayes ScaleR Functions & Algorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  PEMA-R API  rxDataStep  rxExec
  • 22. Microsoft and R • Microsoft Open R/R Server • SQL Server 2016 • Reporting Services & Mobile Reports • PowerBI • Azure Data Lake Storage & Analytics • Azure Machine Learning • Azure HDInsight
  • 24. • 16-18 maj 2016 • Wrocław Centrum Konferencyjne • 3 dni, 6 warsztatów, 4 ścieżki, ponad 30 prelegentów, 50 sesji • 600 uczestników + sponsorzy + prelegenci + organizatorzy • Goście między innymi z USA, Anglii, Niemiec, Ukrainy, Bułgarii, Słoweni • Premiera techniczna SQL Server 2016 sqlday.pl @sqlday lukasz@tidk.pl W tym warsztat Big Data Analytics – Łukasz Grala & Marcin Szeliga

Editor's Notes

  1. Slide objective Introduce the three value proposition pillars of SQL Server 2016 R Services. Talking points SQL Server 2016 R Services brings the perfect mix of fast querying and In-Memory OLTP optimization from SQL Server 2016, as well as data exploration, predictive modeling, scoring, and visualization from the R Services family of products. [CLICK] It delivers unprecedented enterprise speed and performance for advanced analytics, thanks to near-database analytics and parallel threading and processing. [CLICK] It also delivers scalability and choice not seen before from a stable, commercial platform for advanced analytics. Its on-premises, cloud, and hybrid benefits, as well as its limits with large datasets, are unmatched. [CLICK] Finally, there is no additional cost because the offering is included in SQL Server 2016. In addition, the ability to reuse existing R code and eliminate data movement across machines provides significant value.