We present a new, free, open-source framework aimed at making Spark accessible to millions of .NET developers. In this session we will provide a high level overview of the .NET bindings for Spark effort, demonstrate some key capabilities on how you can use and get involved with the effort, and also cover how you can use the .NET bindings for Spark with other .NET frameworks like ML.NET for building E2E real-time analytics solutions. This will be one fun session with demos galore, so come join us as we get started on the .NET bindings for Spark journey!
6. https://github.com/dotnet/spark
.NET – A unified platform
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
LIBRARIES
INFRASTRUCTURE
.NET STANDARD
DESKTOP WEB CLOUD MOBILE GAMING IoT AI
7. • C# is a simple, modern, object-oriented, and type-
safe programming language
• Its roots in the C family of languages makes C#
immediately familiar to C, C++, Java, and JavaScript
programmers
• F# is a cross-platform, open-source, functional
programming language for .NET
• It also includes object-oriented and imperative
programming
• Visual Basic is an approachable language with a
simple syntax for building type-safe, object-
oriented apps
8. .NET
Open Source & Cross-Platform
.NET Core developers
750K
New .NET developers
in last year
+1M
12. https://github.com/dotnet/spark
.NET Developers 💖 Apache Spark but…
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Locked out from Big Data processing due to
lack of .NET support in OSS Big Data
solutions but…
13. https://github.com/dotnet/spark
.NET Developers 💖 Apache Spark but…
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
… a lot of Big Data-usable business
logic (millions of lines of code) is
written in .NET!
Locked out from Big Data processing due to
lack of .NET support in OSS Big Data
solutions but…
14. https://github.com/dotnet/spark
.NET Developers 💖 Apache Spark but…
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
… a lot of Big Data-usable business
logic (millions of lines of code) is
written in .NET!
Locked out from Big Data processing due to
lack of .NET support in OSS Big Data
solutions but…
In a recently conducted .NET Developer
survey (> 1000 developers), more than
70% expressed interest in Apache Spark!
17. https://github.com/dotnet/spark
Why Apache Spark should 💖 .NET Developers?
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
More people who
learn Apache Spark
Solve harder
challenges
together
=
18. https://github.com/dotnet/spark
Why Apache Spark should 💖 .NET Developers?
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
More people who
learn Apache Spark
Solve harder
challenges
together
Make the world a
better place!
= =
19. https://github.com/dotnet/spark
Restating Our Intent…
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.
27. https://github.com/dotnet/spark
… and developing in the open!
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
.NET for Apache Spark was open sourced @Spark+AI Summit 2019
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
28. https://github.com/dotnet/spark
… and developing in the open!
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
.NET for Apache Spark was open sourced @Spark+AI Summit 2019
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
Spark Project Improvement Proposals:
• Interop Support for Spark Language Extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
29. https://github.com/dotnet/spark
… and developing in the open!
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Contributions to foundational OSS projects:
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-
4502, ARROW-4737, ARROW-4543, ARROW-4435
• Pyrolite (Pickling Library): Improve pickling/unpickling performance,
Add a Strong Name to Pyrolite
.NET for Apache Spark was open sourced @Spark+AI Summit 2019
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
Spark Project Improvement Proposals:
• Interop Support for Spark Language Extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
32. https://github.com/dotnet/spark
.NET provides full spectrum Spark support
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Batch & Streaming
(including Spark Structured
Streaming and all Spark-
supported data sources)
Spark DataFrames
(works with Spark
v2.3.x/v2.4.[0/1] and includes
~300 SparkSQL functions)
.NET Standard 2.0
(works with .NET Framework
v4.6.1+ and .NET Core v2.1+
and includes C#/F# support)
.NET
Standard
33. https://github.com/dotnet/spark
.NET for Apache Spark Programmability
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
var spark =
SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);
34. https://github.com/dotnet/spark
Submitting a Spark Application
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic
39. https://github.com/dotnet/spark
What is happening when you write .NET Spark code?
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
SparkSQL
DataFrame
.NET for
Apache Spark
.NET
Program
Did you
define a
.NET
UDF?
Regular execution path
(no .NET runtime during execution)
Register .NET UDF & leverage
PySpark Execution semantics
No
Yes
40. https://github.com/dotnet/spark
Submitting a Spark Application (recap)
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic
50. https://github.com/dotnet/spark
Spark Driver-side Workflow
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
CLRJVM
spark-submit
DotnetRunner
DotnetBackend
Port XYZ
.NET App
Launch .NET app with
config (e.g., Port XYZ)
3Launch
1
.NET objects hold
references to JVM
objects
Send commands via JVMBridge
4
SparkSession
Dataframe
SQL Streaming
…
.NET Proxies for JVM objects
Create &
Manage
Proxy Objects
2 Launch
SparkSession
Dataframe
SQL Streaming
…
JVM objects
Create & Manage JVM
Objects by mirroring
.NET operations
51. https://github.com/dotnet/spark
What happens when you define a .NET UDF?
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
var df = spark.Read().Schema(…)
.Json(@"people.json");
var addition =
Udf<int?, string, string>(
(age, name) => name + age);
df.Select(addition(df["age"],
df["name"]))
.Explain(true);
User code with UDF
Registers UDF with Spark
Serialize .NET UDF
Wrap as PythonFunction &
set executable=Python
Microsoft.Spark.Worker
Create a
UserDefinedPythonFunction
Piggyback on PySpark
Physical Execution Operator
70. https://github.com/dotnet/spark
Next steps for benchmarking…
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Benchmark with
Apache Arrow
TPC-
DS?
Follow the discussion here: https://github.com/dotnet/spark/issues/45
.NET for Apache Arrow
Chris Hutchinson
Initial .NET Implementation
Work with
Community
TPC-H Dataset
Generation
Eric Erhardt
Performance Optimizations
ARROW-4997, ARROW-5019,
ARROW-4839, ARROW-4502, ARROW-
4737, ARROW-4543, ARROW-4435
Improvise
71. https://github.com/dotnet/spark
What’s next after next?
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Programming &
Idiomatic experiences
in .NET
(UDAF, UDT support)
Spark data connectors
in .NET
(e.g., Apache Kafka, Azure
Blob Store, Azure Data Lake)
Tooling experiences
for .NET Developers
(e.g., Jupyter, VS Code,
Visual Studio, others?)
74. https://github.com/dotnet/spark
Open Sourced .NET for Apache Spark
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
https://dot.net/spark | https://github.com/dotnet/spark
Give you first-class experience in scaling out your .NET
over large amounts of data using Apache Spark
.NET
Standard
76. https://github.com/dotnet/spark
Call to Action: Engage, Use & Guide Us!
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
Useful Links:
http://github.com/dotnet/spark
Website:
https://dot.net/spark
Available as out-of-box on Azure
HDInsight Spark
For other clouds –
https://aka.ms/InstallDotNetForSpark
77. https://github.com/dotnet/spark
Contribution Model
#DotNetForSpark #UnifiedAnalytics #SparkAISummit
• Play with .NET Bindings
• Contribute PRs to close existing issues
• Submit a GitHub issue
• Verify fixes for bugs
• Submit a code fix for a bug
• Submit a new feature request
• Submit a unit test
• Code review pending PRs/bug fixes
• Tell others about the .NET Bindings
79. Have questions about this
session?
I’ll be at the Microsoft Booth
#200
from xx:xxam/pm to
xx:xxam/pm.
Grab some SWAG!
For more details visit:
https://databricks.co
m/sparkaisummit