GoPro's data platform evolved from 2016 to 2018 to meet growing analytics needs. In 2016, it focused on batch/streaming ingestion using Spark. In 2017, it transformed to use dynamic elastic clusters, centralized the Hive metastore, and replaced HDFS with S3. In 2018, it added data democratization features like delivering analytics metrics to Slack, added the Druid OLAP database for visualization, and began building out a machine learning infrastructure.
3. ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
4. AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark
7. EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.
13. BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
16. AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3
17. TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)
19. EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
20. BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
21. TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs
22. TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
29. DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
30. SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>
31. SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}
33. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
35. EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
36. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization
37. Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress
40. SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
41. SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
42. SLACK METRICS CONCEPTS
• Slack Job è
• Channels (private channels) è
• Metrics Groups è
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)
43. BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
44. CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly
46. SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
48. EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
49. FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
50. FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
52. FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
54. FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
56. PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
57. SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
61. INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
62. WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features è 200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
64. FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro