Big Data Processing with .NET and Spark (SQLBits 2020)

Big Data Processing with .NET and Spark
Michael Rys
Principal Program Manager, Azure Data
@MikeDoesBigData

Agenda What is Apache Spark
Why .NET for Apache Spark
What is .NET for Apache Spark
Demos
How does it perform
Where does it run
Special Announcement & Call to Action

 Apache Spark is an OSS fast analytics engine for big data and machine
learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes

.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring

Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.

We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Frequent releases (about every 6 weeks), current release v0.12.1
• Integrates with .NET Interactive (https://github.com/dotnet/interactive) and
nteract/Jupyter
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006

Journey so far
~2k
GitHub unique
visitors/wk
~8k
GitHub page
views/wk
260
GitHub issues
closed
246
GitHub PRs
merged
127k
Nuget
Downloads
39
GitHub
Contributors

Customer Success: O365’s MSAI
Job:
Build ML/Deep models on top of
substrate data to infuse intelligence
to Office 365 products. Our data
resides in Azure Data Lake Storage.
We write cook/featurize data that in
turn gets fed into our ML models.
Why Spark.NET?
Given our business logic e.g.,
featurizers, tokenizers for
normalizing text, are written in C# –
Spark.NET is an ideal candidate for
our workloads. We leverage
Spark.NET to run those libraries at
scale.
Experience:
Very promising, stable & highly
vibrant community that is helping us
iterate at the agility we want.
Looking forward to longer working
relationship and broader adoption
across Substrate Intelligence / MSAI.
Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365
Scale: ~ 50 TB

.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.1
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples

UserId State Salary
Terry WA XX
Rahul WA XX
Dan WA YY
Tyson CA ZZ
Ankit WA YY
Michae
l
WA YY
Introduction to Spark Programming:
DataFrame

.NET for Apache Spark programmability
var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);

Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)

Demo 1: Getting started locally

Submitting a Spark Application
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic

Demo 2: Locally debugging a .NET for Spark
App
spark-submit --class
org.apache.spark.deploy.DotnetRunner `
--master local <path-to-microsoft-spark-jar> `

Debugging User-defined Code
https://github.com/dotnet/spark/pull/294
Step 1
Write your app code
Step 2
set DOTNET_WORKER_DEBUG=1
Run spark-submit with debug argument
Step 3
Switch to app code, add breakpoint
at your business logic, F5
Step 4
In the `Choose Just-In-Time
Debugger`, choose “New Instance of
…”, select your app code CS file
Step 5
That’s it! Have fun 

Demo 2: Twitter analysis in the Cloud

What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree

Spark Worker Node JVM
Spark Executor Microsoft.Spark.Worker
Spark Worker Node CLR
Run a task with
a UDF
1
Launch worker executable2
3 Serialize UDFs &
data
.NET UDF Library
4 Execute user-defined
operations
5 Write serialized result rows
User Spark Library
Legend:
Interop (Scala) Interop (.NET)
Challenge:
How to serialize data
between JVM & CLR?
Pickling
Row-oriented
Apache Arrow
Column-oriented
Default
Performance: Worker-side Interop

df.GroupBy("age")
.Apply(
new StructType(new[]
{
new StructField("age", new IntegerType()),
new StructField("nameCharCount", new IntegerType())
}),
batch => CountCharacters(batch, "age", "name"))
.Show();
Simplifying experience with Arrow

private static FxDataFrame CountCharacters(
FxDataFrame df,
string groupColName,
string summaryColName)
{
int charCount = 0;
for (long i = 0; i < df.RowCount; ++i)
{
charCount += ((string)df[summaryColName][i]).Length;
}
return new FxDataFrame(new[] {
new PrimitiveColumn<int>(groupColName,
new[] { (int?)df[groupColName][0] }),
new PrimitiveColumn<int>(summaryColName,
new[] { charCount }) });
}
private static RecordBatch CountCharacters(
RecordBatch records,
string groupColName,
string summaryColName)
{
int summaryColIndex = records.Schema.GetFieldIndex(summaryColName);
StringArray stringValues = records.Column(summaryColIndex) as StringArray;
int charCount = 0;
for (int i = 0; i < stringValues.Length; ++i)
{
charCount += stringValues.GetString(i).Length;
}
int groupColIndex = records.Schema.GetFieldIndex(groupColName);
Field groupCol = records.Schema.GetFieldByIndex(groupColIndex);
return new RecordBatch(
new Schema.Builder()
.Field(groupCol)
.Field(f => f.Name(summaryColName).DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
records.Column(groupColIndex),
new Int32Array.Builder().Append(charCount).Build()
},
records.Length);
}
Previous Experience New Experience
Simplifying experience with Arrow

Performance –
warm cluster runs
for Pickling
Serialization
(Arrow
improvements see
next slide)
Takeaway 1: Where UDF
performance does not
matter, .NET is on-par
with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!

Performance –
Warm Cluster
Runs for C#
Pickling vs.
Arrow
Serialization
Takeaway: Since Q1 is
interop bound, we see 33%
perf improvement with
better serialization

Performance –
Warm Cluster
Runs for Arrow
Serialization in
C# vs. Python
Takeaway: Since serialization
inefficiencies have been removed,
we are left with similar perf across
languages – if you like .NET, you can
stick with .NET 

Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github

• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet

• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Batch Submission
Language selects semantics of
submission fields
ZIP file that contains the Spark
application, including UDF DLLs, and
even the Spark or .NET Runtime if a
different version is needed
Main Program (Unix)
Program Parameters as needed
Additional resource and library files
that are not included in the ZIP (e.g.,
shared DLLs, config files)
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet

Notebooks with .NET Interactive
Language selects Type of notebook
Interactive C#
Spark context spark is built-in

Notebooks with .NET Interactive – importing nuget packages

Using .NET for Spark in Azure Databricks
• Not available out of the box but can be used in batch submission
• https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks
Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET.
Please contact @Databricks if you want to use it out of the box 

VSCode extension for Spark .NET
• Spark .NET Project creation
• Dependency packaging
• Language service
• Sample code
Author
• Reference management
• Spark local run
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight clusters
navigation
 Azure Databricks integration planned

ANNOUNCING: .NET for Apache Spark v1.0 is released!
 First-class C# and F# bindings to Apache Spark,
bringing the power of big data analytics to .NET
developers
Apache Spark 2.4/3.0
Data Frames, Structured
Streaming, Delta Lake
.NET Standard 2.0
C# and F#
ML.NET
Performance optimized
with Apache Arrow and
HW Vectorization
First class integration in
Azure Synapse: Batch
Submission
Interactive .NET notebooks
Learn more at
http://dot.net/Spark

More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g.,
Jupyter/nteract,
VS Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB Spark,
SQL 2019 BDC, …)

Call to action: Engage, use & guide us!
Related session:
• Big Data and Data Warehousing Together with Azure
Synapse Analytics
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You &
@MikeDoesBigData #DotNetForSpark

Big Data Processing with .NET and Spark (SQLBits 2020)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Processing with .NET and Spark (SQLBits 2020)

Similar to Big Data Processing with .NET and Spark (SQLBits 2020) (20)

More from Michael Rys

More from Michael Rys (20)

Recently uploaded

Recently uploaded (20)

Big Data Processing with .NET and Spark (SQLBits 2020)

Editor's Notes