Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
12. Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
13. Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
○ How do they sign-up, engage, pay, and how long do
they stay
14. Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
○ How do they sign-up, engage, pay, and how long do
they stay
● Allows us to make data driven decisions
16. Gnar
segment “eventName”
where foo = “bar”
by browser
time from 2 months ago to today
User writes a query
using GQL using our
web application
*GQL stands for Gnar Query language, a SQL like language
built on top of Spark SQL
17. Gnar
Sent to our backend
which will run a
Spark job
segment “eventName”
where foo = “bar”
by browser
time from 2 months ago to today
18. Gnar
Results will be sent
back to the user and
displayed
segment “eventName”
where foo = “bar”
by browser
time from 2 months ago to today
19. Gnar
● When users write queries they use something
called expressions to describe what they want to
do
20. Gnar
● When users write queries they use something
called expressions to describe what they want to
do
○ The previous query had 2
21. Gnar
● When users write queries they use something
called expressions to describe what they want to
do
○ The previous query had 2
segment “eventName”
where foo = “bar”
by browser
time from 2 months ago to today
22. Gnar
● When users write queries they use something
called expressions to describe what they want to
do
● Hundreds of queries are run every day, and all of
them use expressions
54. UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
55. UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
56. UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
○ Spark SQL data types
57. UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
○ Spark SQL data types
● Hard to have a stateful implementation
61. Constant Folding
hours * ( 60 * 60 * 1000)
● Commonly done to get milliseconds in an hour
● How do we reduce the time we spend
computing this largely static operation
69. Constant Folding
*
hours3.6e6
You can make your expressions a candidate for constant
folding by adding the following to your expression class
def foldable: Boolean = true
70. Constant Folding
1.6x Slower
1.96x Slower
1.97x Slower
One billion rows on single node machine
spark.range(1000000000l).withColumn("m", <expr>).rdd.count
1.68x Slower
86. Optimizations
● All these optimizations are rules which are
implemented with pattern matching
● If an expression matches the rule, it is applied
87. Optimizations
● All these optimizations are rules which are
implemented with pattern matching
● If an expression matches the rule, it is applied
● UDFs aren’t expressions so you cannot apply
many of these optimizations
88. Rule Example: Constant Folding
object ConstantFolding extends Rule {
def apply(plan: Plan): Plan = plan transformExpressions {
case l: Literal => l
case e if e.foldable =>
Literal.create(e.eval(EmptyRow), e.dataType)
}
}
89. Rule Example: Boolean Simplification
object BooleanSimplification extends Rule {
def apply(plan: Plan): Plan = plan transformExpressions {
case TrueLiteral And e => e
case FalseLiteral Or e => e
case e Or FalseLiteral => e
case FalseLiteral And _ => FalseLiteral
case TrueLiteral Or _ => TrueLiteral
case Not(TrueLiteral) => FalseLiteral
...
}
}
93. Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
94. Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
○ Easy to maintain state
95. Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
○ Easy to maintain state
○ You can specify code generation
112. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
113. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
114. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custom Expressions are great
115. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custom Expressions are great
○ Performance matters
116. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custom Expressions are great
○ Performance matters
○ Complex operations which require lower level API
117. Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custom Expressions are great
○ Performance matters
○ Complex operations which require lower level API
● We use both of them to solve our complex problems