6. Our team
● Data scientists
○ Coming up with the new magic
● Data engineers
○ Productionalizing the magic on large datasets
● Front end developer
○ Consumes results to make it presentable to
clients.
7. Requirements
● Across geography developers
● Variety of developers in team
● Better code quality
● Better testing mechanisms
● Easier team expansion
● Lesser infrastructure maintenance overhead
● Use latest libraries available
9. Iteration 1
● Data scientists
○ They were well versed with Python or SQL
○ They did analysis using Python Panda dataframe code
○ Analysis were tested on only small set of data
● Data engineers
○ Using Spark - Spark 0.9
○ They used to port Python to Scala RDD API to be able to
scale the analysis to big data
○ Custom Framework with ability to write into and read from
multiple sources (File, Hive Table, S3, JDBC)
11. Challenges
● Framework challenges
○ Porting code from one language to another would lead
to a lot of inaccuracies
○ Differences in the language constructs and API lead to
change in code design
● Architectural challenges
○ Clusters used by the team were manually created and
maintained
○ Intermediate data was saved in a text based csv
format.
13. Iteration 2
● Upgrade to Spark 1.3
● Data scientists
○ Dataframe API was introduced which was a better known
interface for Data scientists
○ SQL API was easier for the Data scientist to perform simple
operations
○ Zeppelin for Data scientists to prototype the analytical
algorithms
● Data engineers
○ CSV based intermediate format to Parquet
○ Amazon EMR based Hadoop cluster with Spark on it
14. Data science cluster
Data engineer Architecture
Stocks
data
ETL HDFS
Zeppelin
Dashboard
Data Analytics
(PySpark)
Data engineering cluster
Data
preprocessing
Data Analytics NoSQL
15. Challenges
● Quality challenges
○ Productionalizing multiple analysis required
expansion of Data engineering team
○ Team expansion induced code quality issues and
bugs in the code
○ Unit tests for the each functionalities were not
present
○ Review process for the changes in the code were
not present
17. Iteration 3
● Creation of unit test cases for all the analysis
● More readable test case suite for the code using
ScalaTest (http://www.scalatest.org/)
● Test cases for unit testing small functionalities and
flow testing to test the full ETL flow on sampled data
● Review process for the changes in the code through
Github PR
● Daily build in Jenkins to test the flow and
functionalities on a daily basis
18. ScalaTest
class ExampleSpec extends FlatSpec with Matchers {
"A Stack" should "pop values in last-in-first-out order" in {
val stack = new Stack[Int]
stack.push(1)
stack.push(2)
stack.pop() should be (2)
stack.pop() should be (1)
}
it should "throw NoSuchElementException if an empty stack is popped" in {
val emptyStack = new Stack[Int]
a [NoSuchElementException] should be thrownBy {
emptyStack.pop()
}
}
}
20. Challenges
● Architectural challenges
○ Cluster resources was a bottleneck for the teams
○ Amazon EMR clusters were not throw away
clusters as data was stored in HDFS.
○ Upgrading the Spark version on the cluster was
difficult
○ Infrastructure to run scheduled jobs was missing
as Jenkins was not the best way to schedule jobs
○ Stability issues with Zeppelin
22. Iteration 4
● Moved the data storage from HDFS to s3
● Moved to Databricks cloud environment (https:
//databricks.com/product/databricks)
● Databricks cloud provides notebook based interface
for writing Spark code in Scala, Java, Python and R
● Encourage data scientists to use Scala API
● Travis for deployment and testing
26. Improvements
● Data engineers
○ Cluster bottleneck was solved with creating multiple
throw away clusters when needed.
○ Need not stick to a cluster for a long time as primary
data storage was s3
○ Terminating cluster when not being used would be
cost efficient
○ Multiple clusters with different versions of Spark
enables the user to try out the latest feature in Spark
○ Cluster maintenance and tuning overhead
27. Improvements
● Data engineers
○ Lesser turnaround time in understanding bottlenecks in
the workflows
○ Databricks cloud Jobs can be used for scheduling
workflows and daily runs
○ Travis enabled strict and immediate code testing
● Data scientists
○ Data Scientists can easily share the notebooks and
results of the analysis with the team
○ Ability to write in multiple languages
29. Challenges
● Framework challenges
○ Schema is static and doesn’t change frequently
○ Dataframe doesn’t have static schema check
○ Pipeline fails in the middle of the processing if there
is any change in the data
○ Current window analysis uses Scala constructs to
load specific set of data to memory and run ML on
top of it
○ Domain object based functions are called from
inside udf currently
31. Iteration 5 (Future iteration)
● Data engineers
○ Port analysis from Dataframe API into Dataset API
(in Spark 2.0)
○ With Dataset API, we get static schema check
○ Using existing Domain object based functions
● Data scientists
○ Move from Scala window based analysis to
SparkSQL window analytics
32. Lookback
● Spark version
○ 0.9 -> 1.6.0
● API
○ RDD -> Dataframe -> Dataset
● Deployment
○ EC2 -> EMR -> DB cloud
● Scheduling
○ Jenkins -> DB cloud Jobs
● Language
○ Scala
33. Lookback
● Data format
○ Text -> Parquet
● Storage
○ HDFS -> s3
● Deployment
○ Jenkins -> Travis