Building data pipelines for modern data warehouse with Apache® Spark™ and .NET in Azure (Microsoft Build 2019)

5
Data sources
ETL
Increasing
data volumes
1
Non-relational data
New data
sources and types
2
Cloud-born
data
3
DESIGNED FOR THE
QUESTIONS YOU KNOW!

The data lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Schematize for
scenario, run scale
out analysis
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

The modern data warehouse extends the scope of the data warehouse to serve big data
that’s prepared with techniques beyond relational ETL
Modern data warehousing
“We want to integrate all our
data—including big data—with
our data warehouse”
Advanced analytics
Advanced analytics is the process
of applying machine learning
and deep learning techniques to
data for the purpose of creating
predictive and prescriptive
insights
“We’re trying to predict when
our customers churn”
Real-time analytics
“We’re trying to get insights from
our devices in real-time”

INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Data Lake Storage
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Analysis
Services
Azure Databricks
(Python, Scala, Spark SQL,
.NET for Apache Spark)
Polybase
Business/custom apps
(Structured)
Power BI
Azure also supports other Big Data services like Azure HDInsight and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.
ORCHESTRATION & DATA FLOW ETL
Azure Data Factory

 Apache Spark is an OSS fast analytics engine for big data and machine learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
 Azure Databricks
 Azure HDInsight
 IaaS/Kubernetes

A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring

Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.

• Interop layer for .NET (Scala-side)
• Potentially optimizing Python and R interop layers
• Technical documentation, blogs and articles
• End-to-end scenarios
• Performance benchmarking (cluster)
• Production workloads
• Out of Box with Azure HDInsight, easy to use with Azure Databricks
• C# (and F#) language extensions using .NET
• Performance benchmarking (Interop)
• Portability aspects (e.g., cross-platform .NET Standard)
• Tooling (e.g., Apache Jupyter, Visual Studio, Visual Studio Code)
Microsoft is committed…

Contributions to foundational OSS projects:
• Apache arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-
4502, ARROW-4737, ARROW-4543, ARROW-4435
• Pyrolite (pickling library): Improve pickling/unpickling performance, Add
a Strong Name to Pyrolite
.NET for Apache Spark was open sourced @Spark+AI Summit 2019
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006

Spark DataFrames
with SparkSQL
works with
Spark v2.3.x/v2.4.[0/1]
and includes
~300 SparkSQL
functions
.NET Spark UDFs
Batch &
streaming
including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
works with
.NET Framework v4.6.1+
and .NET Core v2.1+
and includes C#/F#
support
.NET
Standard
Machine Learning
Including access to
ML.NET
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark
https://github.com/dotnet/spark/examples

var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);

val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") ===
minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] ==
minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)

DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Interop between
Spark and .NET
No
Yes
Spark
operation tree

Performance–
warmcluster runs
forPickling
Serialization
(Arrowwillbe
tested inthe
future)
Takeaway 1: Where UDF
performance does not
matter, .NET is on-par
with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!

Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark

• Spark .NET Project
creation
• Dependency packaging
• Language service
• Sample code
Author
• Reference management
• Spark local run
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight clusters
navigation
 Azure Databricks integration planned

More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g., Jupyter, VS
Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure HDInsight,
Azure Databricks,
Cosmos DB
Spark, SQL 2019
BDC, …)

CREATE EXTERNAL DATA SOURCE MyADLSGen2
WITH (TYPE = Hadoop,
LOCATION = ‘abfs://<filesys>@<account_name>.dfs.core.windows.net’,
CREDENTIAL = <Database scoped credential>);
CREATE EXTERNAL FILE FORMAT ParquetFile
WITH ( FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec’,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
CREATE EXTERNAL TABLE [dbo].[Customer_import] (
[SensorKey] int NOT NULL,
int NOT NULL,
[Speed] float NOT NULL)
WITH (LOCATION=‘/Dimensions/customer',
DATA_SOURCE = MyADLSGen2,
FILE_FORMAT = ParquetFile)
Once per store
account (WASB,
ADLS G1, ADLS G2)
Once per file
format, supports
Parquet (snappy or
Gzip), ORC, RC,
CSV/TSV
Folder path

CREATE TABLE [dbo].[Customer] WITH
( DISTRIBUTION = ROUND_ROBIN
, CLUSTERED INDEX (customerid)
)
AS SELECT * FROM [dbo].[Customer_import]
INSERT INTO [dbo].[Customer]
SELECT *
FROM [dbo].[Customer_import]
WHERE <predicate to determine new data>

Useful links:
• http://github.com/dotnet/spark
https://aka.ms/GoDotNetForSpark
Website:
• https://dot.net/spark
Available out-of-box on Azure
HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSparkYou & .NET

https://github.com/dotnet/spark
https://dot.net/spark
https://docs.microsoft.com/dotnet/spark
https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/
https://www.slideshare.net/MichaelRys
Spark Language Interop Spark Proposal
“.NET for Spark” Spark Project Proposal

Thank you
for attending
Build 2019

Building data pipelines for modern data warehouse with Apache® Spark™ and .NET in Azure (Microsoft Build 2019)

More Related Content

Building data pipelines for modern data warehouse with Apache® Spark™ and .NET in Azure (Microsoft Build 2019)

Editor's Notes