SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Analytics metrics delivery and
ML Feature visualization
Evolution of Data Platform at GoPro
ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark
GROWING DATA NEED FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization
EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.
Evolution of Data Platform
EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
DATA PLATFORM ARCHITECTURE TRANSFORMATION
Batch Ingestion
Framework
Batch Ingestion
Pre-processing
Streaming ingestion
Batch Ingestion
Cloud-Based Elastic Clusters
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Dynamic DDL
State Sync
Parquet
STREAMING PIPELINES
Spark Cluster
Long Running Cluster
BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances
AIRFLOW SETUP
AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3
TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)
CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL è Spark
EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs
TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
HIVE TABLE WRITER JOB
trait HiveTableWriter extends CoreHiveTableWriter with SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config)
def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)]
def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config)
def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo]
def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config)
def postProcess(….)
def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo]
def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification
ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification
Files needs to goto proper tables
TABLE NAME GENERATION
• Table Name Extractor
• From File Name
• From Directory Name
• Custom Plugin
EXTRACT TABLE NAMES
• From Table Name
• /databucket/3rdparty/ABC/campaign-20180212.csv
• /databucket/3rdparty/ABC/campaign-20180213.csv
• /databucket/3rdparty/ABC/campaign-20180214.csv
• From Directory Name
• /databucket/3rdparty/ABC/campaign/file-20180212.csv
• /databucket/3rdparty/ABC/campaign/file-20180213.csv
• /databucket/3rdparty/ABC/campaign/file-20180214.csv
• From ID Mapping
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz
• /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz
• /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
Configuration
• b2a932aeddbf0f11bae9573 è mobile_ios
• ae6905b068c7beb08d681a è mobile_android
Table Extraction
• (mobile_ios, 2017-01-11)
• (mobile_android, 2018-02-17)
• (mobile_android, 2018-02-18)
Data Transformation
ETL With SQL & Scala
DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>
SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}
SCALA SCRIPTS
import com.twitter.util.Eval
def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = {
val fs = //get Hadoop File System …
eval(IOUtils.toString(fs.open(path), "UTF-8"))(header)
}
def eval[T](code: String)(implicit header: String = ""): Option[T] =
Try(Eval[T](header + "n" + code)).toOption
SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
Data Democratization,
Visualization and Data
Management
EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization
Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress
Delivery Metrics via Slack
SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx
SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
SLACK METRICS CONCEPTS
• Slack Job è
• Channels (private channels) è
• Metrics Groups è
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)
BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly
SLACK METRICS CONFIGURATION
slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group
//…
weekly-capture-users-by-platform { //metrics name
slack-display.attachment.title = "GoPro Mobile App -- Users by Platform"
metric-period = "weekly”
slack-display.chartstyle { … }
query = ""” … """
compare.query = ""” … """
chart query = ""”… ""”
}
//rest of configuration
}
SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
Machine Learning Feature
Visualization with Facets + Spark
EVOLUTION OF DATA PLATFORM
2016
Batch/Streaming
Ingestion Framework
Hard-coded Hive SQL ⇒
Spark Jobs and Spark SQL in
Fixed Hadoop Cluster
2017
Platform Architecture
Transformation
Fixed Hadoop Clusters ⇒
Dynamic Elastic Cluster,
Centralize Hive Metastore,
Replace HDFS with S3
2018
Data Democratization,
ML Infrastructure
Slack Metrics delivery, add
Druid OLAP database &
Visualization with superset
integration, Data Management
Platform
Initial attempts on ML
Infrastructure
Before 2016
Single Hadoop Cluster ⇒
Three Hadoop Clusters
Hive/Impala SQL with Tableau
Reports
Fixed-size Hadoop
Clusters
FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
FACETS OVERVIEW SAMPLE
FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
FACETS OVERVIEW
FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
FACETS OVERVIEW + SPARK
ScalaPB
PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
FACET OVERVIEW SPARK
FACET OVERVIEW SPARK
DEMO
INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features è 200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
PG #
RC Playbook: Your guide to success
at GoPro
FINAL THOUGHTS
FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro

Weitere ähnliche Inhalte

Was ist angesagt?

High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...Databricks
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingDatabricks
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
Managing your ML lifecycle with Azure Databricks and Azure ML
Managing your ML lifecycle with Azure Databricks and Azure MLManaging your ML lifecycle with Azure Databricks and Azure ML
Managing your ML lifecycle with Azure Databricks and Azure MLParashar Shah
 
Nasscom ml ops webinar
Nasscom ml ops webinarNasscom ml ops webinar
Nasscom ml ops webinarSameer Mahajan
 
MLflow at Company Scale
MLflow at Company ScaleMLflow at Company Scale
MLflow at Company ScaleDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019Anyscale
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
The Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackDatabricks
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflowDatabricks
 
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise PlatformBIOVIA
 

Was ist angesagt? (20)

High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Managing your ML lifecycle with Azure Databricks and Azure ML
Managing your ML lifecycle with Azure Databricks and Azure MLManaging your ML lifecycle with Azure Databricks and Azure ML
Managing your ML lifecycle with Azure Databricks and Azure ML
 
Nasscom ml ops webinar
Nasscom ml ops webinarNasscom ml ops webinar
Nasscom ml ops webinar
 
MLflow at Company Scale
MLflow at Company ScaleMLflow at Company Scale
MLflow at Company Scale
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
The Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML Stack
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform
(ATS4-DEV09) Visualizing SmartLab Data Using the Accelrys Enterprise Platform
 

Ähnlich wie Analytics Metrics Delivery & ML Feature Visualization

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовCowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовSigma Software
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Cowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис ТрофімовCowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис ТрофімовSigma Software
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 

Ähnlich wie Analytics Metrics Delivery & ML Feature Visualization (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовCowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Cowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис ТрофімовCowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис Трофімов
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
SAP Business Objects Trianing
SAP Business Objects TrianingSAP Business Objects Trianing
SAP Business Objects Trianing
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 

Mehr von Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectBill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeBill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixBill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScaleBill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningBill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileBill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningBill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsBill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldBill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeBill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 

Mehr von Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Analytics Metrics Delivery & ML Feature Visualization

  • 1.
  • 2. Analytics metrics delivery and ML Feature visualization Evolution of Data Platform at GoPro
  • 3. ABOUT SPEAKER: CHESTER CHEN • Head of Data Science & Engineering (DSE) at GoPro • Prev. Director of Engineering, Alpine Data Labs • Founder and Organizer of SF Big Analytics meetup
  • 4. AGENDA • Business Use Cases • Evalution of GoPro Data Platform • Analytics Metrics Delivery via Slack • ML Feature Visualization Features with Google Facets and Spark
  • 5. GROWING DATA NEED FROM GOPRO ECOSYSTEM
  • 6. DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight User segmentation CRM/Marketing /Personalization
  • 7. EXAMPLES OF ANALYTICS USE CASES • Product Analytics • features adoptions, user engagements, User segmentation, churn analysis, funnel analysis, conversion rate etc. • Web/E-Commercial Analytics • Camera Analytics • Scene change detection, feature usages etc. • Mobile Analytics • Camera connections, story sharing etc. • GoPro Plus Analytics • CRM Analytics • Digital Marketing Analytics • Social Media Analytics • Cloud Media Analysis • Media classifications, recommendations, storage analysis.
  • 8. Evolution of Data Platform
  • 9. EVOLUTION OF DATA PLATFORM 2016 Batch/Streaming Ingestion Framework Hard-coded Hive SQL ⇒ Spark Jobs and Spark SQL in Fixed Hadoop Cluster 2017 Platform Architecture Transformation Fixed Hadoop Clusters ⇒ Dynamic Elastic Cluster, Centralize Hive Metastore, Replace HDFS with S3 2018 Data Democratization, ML Infrastructure Slack Metrics delivery, add Druid OLAP database & Visualization with superset integration, Data Management Platform Initial attempts on ML Infrastructure Before 2016 Single Hadoop Cluster ⇒ Three Hadoop Clusters Hive/Impala SQL with Tableau Reports Fixed-size Hadoop Clusters
  • 10. EVOLUTION OF DATA PLATFORM 2016 Batch/Streaming Ingestion Framework Hard-coded Hive SQL ⇒ Spark Jobs and Spark SQL in Fixed Hadoop Cluster 2017 Platform Architecture Transformation Fixed Hadoop Clusters ⇒ Dynamic Elastic Cluster, Centralize Hive Metastore, Replace HDFS with S3 2018 Data Democratization, ML Infrastructure Slack Metrics delivery, add Druid OLAP database & Visualization with superset integration, Data Management Platform Initial attempts on ML Infrastructure Before 2016 Single Hadoop Cluster ⇒ Three Hadoop Clusters Hive/Impala SQL with Tableau Reports Fixed-size Hadoop Clusters
  • 11. DATA PLATFORM ARCHITECTURE TRANSFORMATION Batch Ingestion Framework Batch Ingestion Pre-processing Streaming ingestion Batch Ingestion Cloud-Based Elastic Clusters PLOT.LY SERVER TABLEAU SERVER EXTERNAL SERFVICE Notebook Rest API, FTP S3 sync,etc Dynamic DDL State Sync Parquet
  • 13. BATCH JOBS Job Gateway Spark ClusterScheduled Jobs New cluster per Job Dev Machines Spark ClusterDev Jobs New or existing cluster Production Job.conf Dev Job.conf
  • 14. INTERACTIVE/NOTEBOOKS Spark Cluster Long Running Clusters Notebooks Scripts (SQL, Python, Scala) Scheduled Notebook Jobs auto-scale mixed on-demand & spot Instances
  • 16. AIRFLOW SETUP Web Server LB Scheduler Airflow Metastore WorkersWorkers Workers Workers Workers Web Server B Web Server LB Web Server A Message Queue Airflow DAGs sync Push DAGs to S3
  • 17. TAKEAWAYS • Key Changes • Centralize hive meta store • Separate compute and storage needs • Leverage S3 as storage • Horizontal scale with cluster elasticity • Less time in managing infrastructure • Key Benefits • Cost • Reduce redundant storage, compute cost. • Use the smaller instance types • 60% AWS cost saving comparing to 1 year ago • Operation • Reduce the complexity of DevOp Support • Analytics tools • SQL only => Notebook with (SQL, Python, Scala)
  • 18. CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK HIVE SQL è Spark
  • 19. EVOLUTION OF DATA PLATFORM 2016 Batch/Streaming Ingestion Framework Hard-coded Hive SQL ⇒ Spark Jobs and Spark SQL in Fixed Hadoop Cluster 2017 Platform Architecture Transformation Fixed Hadoop Clusters ⇒ Dynamic Elastic Cluster, Centralize Hive Metastore, Replace HDFS with S3 2018 Data Democratization, ML Infrastructure Slack Metrics delivery, add Druid OLAP database & Visualization with superset integration, Data Management Platform Initial attempts on ML Infrastructure Before 2016 Single Hadoop Cluster ⇒ Three Hadoop Clusters Hive/Impala SQL with Tableau Reports Fixed-size Hadoop Clusters
  • 20. BATCH INGESTION GoPro Product data 3rd Parties Data 3rd Parties Data 3rd Parties Data Rest APIs sftp s3 sync s3 sync Batch Data Downloads Input File Formats: CSV, JSON Spark Cluster New cluster per Job
  • 21. TABLE WRITER JOBS • Job are identified by JobType, JobName, JobConfig • Majority of the Spark ETL Jobs are Table Writers • Load data into DataFrame • DataFrame to DataFrame transformation • Output DataFrame to Hive Table • Majority of table writer Jobs can be de-composed as one of the following sub jobs
  • 22. TABLE WRITER JOBS SparkJob HiveTableWriter JDBCToHiveTableWriter AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter CSVTableWriter JSONTableWriter FileToHiveTableWriter HBaseToHiveTableWriter TableToHiveTableWriter HBaseSnapshotJob TableSnapshotJob CoreTableWriter Customized Json JobCustomized CSV Job mixin All jobs has the same way of configuration loading, Job State and error reports All table writers will have the Dynamic DDL capabilities, as long as they becomes DataFrames, they will be behave the same CSV and JSON have different loader Need different Loader to load HBase Record to DataFrame Aggregate Jobs
  • 23. HIVE TABLE WRITER JOB trait HiveTableWriter extends CoreHiveTableWriter with SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config) def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)] def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config) def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo] def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config) def postProcess(….) def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo] def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
  • 24. ETL JOB CONFIGURATION gopro.dse.config.etl { mobile-job { conf {} process {} input {} output {} post.process {} } } include classpath("conf/production/etl_mobile_quik.conf") include classpath("conf/production/etl_mobile_capture.conf") include classpath("conf/production/etl_mobile_product_events.conf") Job-level conf override JobType Conf Job specifics includes JobType JobName Input & output specification
  • 25. ETL JOB CONFIGURATION xyz { process {} input { delimiter = "," inputDirPattern = "s3a://teambucket/xyz/raw/production" file.ext = "csv" file.format = "csv" date.format = "yyyy-MM-dd hh:mm:ss" table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName" } output { database = “mobile", file.format = "parquet" date.format = "yyyy-MM-dd hh:mm:ss" partitions = 2 file.compression.codec.key = "spark.sql.parquet.compression.codec" file.compression.codec.value = "gzip” save.mode = ”append" transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer] } post.process { deleteSource = true } } Save Mode JobName Input specification output specification
  • 26. Files needs to goto proper tables TABLE NAME GENERATION • Table Name Extractor • From File Name • From Directory Name • Custom Plugin
  • 27. EXTRACT TABLE NAMES • From Table Name • /databucket/3rdparty/ABC/campaign-20180212.csv • /databucket/3rdparty/ABC/campaign-20180213.csv • /databucket/3rdparty/ABC/campaign-20180214.csv • From Directory Name • /databucket/3rdparty/ABC/campaign/file-20180212.csv • /databucket/3rdparty/ABC/campaign/file-20180213.csv • /databucket/3rdparty/ABC/campaign/file-20180214.csv • From ID Mapping • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz • /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz • /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date Configuration • b2a932aeddbf0f11bae9573 è mobile_ios • ae6905b068c7beb08d681a è mobile_android Table Extraction • (mobile_ios, 2017-01-11) • (mobile_android, 2018-02-17) • (mobile_android, 2018-02-18)
  • 29. DATA TRANSFORMATION • HSQL over JDBC via beeline • Suitable for non-java/scala/python-programmers • Spark Job • Requires Spark and Scala knowledge, need to setup job, configurations etc. • Dynamic Scala Scripts • Scala as script, compile Scala at Runtime, mixed with Spark SQL
  • 30. SCALA SCRIPTS • Define a special SparkJob : Spark Job Code Runner • Load Scala script files from specified location (defined by config) • Dynamically compiles the scala code into classes • For the compiled classes : run spark jobs defined in the scripts • Twitter EVAL Util: Dynamically evaluates Scala strings and files. • <groupId>com.twitter</groupId> <artifactId>util-eval_2.11</artifactId> <version>6.24.0</version>
  • 31. SCALA SCRIPTS object SparkJobCodeRunner extends SparkJob { private val LOG = LoggerFactory.getLogger(getClass) import collection.JavaConverters._ override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val jobFileNames: List[String] = //... jobFileNames.foreach{ x => val clazzes : Option[Any] = evalFromFileName[Any](x) clazzes.foreach{c => c match { case job: SparkJob => job.run(sc, jobType, jobName, config) case _ => LOG.info("not match") } } } } }
  • 32. SCALA SCRIPTS import com.twitter.util.Eval def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = { val fs = //get Hadoop File System … eval(IOUtils.toString(fs.open(path), "UTF-8"))(header) } def eval[T](code: String)(implicit header: String = ""): Option[T] = Try(Eval[T](header + "n" + code)).toOption
  • 33. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE class CameraAggCaptureMainJob extends SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc) val cameraCleanDataSchema = … //define DataFrame Schema val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema) .json("s3a://databucket/camera/work/production/clean-events/final/*") cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data") sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict set hive.enforce.bucketing=false set hive.auto.convert.join=false set hive.merge.mapredfiles=true""") sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” ) //rest of code } new CameraAggCaptureMainJob
  • 35. EVOLUTION OF DATA PLATFORM 2016 Batch/Streaming Ingestion Framework Hard-coded Hive SQL ⇒ Spark Jobs and Spark SQL in Fixed Hadoop Cluster 2017 Platform Architecture Transformation Fixed Hadoop Clusters ⇒ Dynamic Elastic Cluster, Centralize Hive Metastore, Replace HDFS with S3 2018 Data Democratization, ML Infrastructure Slack Metrics delivery, add Druid OLAP database & Visualization with superset integration, Data Management Platform Initial attempts on ML Infrastructure Before 2016 Single Hadoop Cluster ⇒ Three Hadoop Clusters Hive/Impala SQL with Tableau Reports Fixed-size Hadoop Clusters
  • 36. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS • Data Metrics Delivery • Delivery to Slack : make metrics more accessible to broader audience • Data Slice & Dice • Leverage Real-Time OLAP database (Druid) (ongoing project) • Analytics Visualization (ongoing project) • Leverage Superset and Data Management Application • BedRock: Self-Service & Data Management (ongoing project) • Pipeline Monitoring • Product Analytics Visualization • Self-service Ingestion • ML Feature Visualization
  • 37. Spark Cluster New or existing cluster Spark Cluster Long Running Cluster Metrics Batch Ingestion Streaming Ingestion Output Metrics BedRock DATA VISUALIZATION & MANAGEMENT Working in Progress
  • 39. SLACK METRICS DELIVERY xxxxxx xxxxxxx xxxxx xxxxxxxxxx xxxxx xxxxxxx xxxxxx xxxxxx xxxxx xxxxx xxxx xxxxxxxxxxxxxxxx
  • 40. SLACK METRICS DELIVERY • Why Slack ? • Push vs. Pull -- Easy Access • Avoid another Login when view metrics • When Slack Connected, you are already login • Move key metrics move away from Tableau Dashboard and put metrics generation into software engineering process • SQL code is under software control • publishing job is scheduled and performance is monitored • Discussion/Question/Comments on the specific metrics can be done directly at the channel with people involved.
  • 41. SLACK DELIVERY FRAMEWORK • Slack Metrics Delivery Framework • Configuration Driven • Multiple private Channels : Mobile/Cloud/Subscription/Web etc. • Daily/Weekly/Monthly Delivery and comparison • New metrics can be added easily with new SQL and configurations
  • 42. SLACK METRICS CONCEPTS • Slack Job è • Channels (private channels) è • Metrics Groups è • Metrics1 • … • MetricsN • Main Query • Compare Query (Optional) • Chart Query (Options) • Persistence (optional) • Hive + S3 • Additional deliveries (Optional) • Kafka • Other Cache stores (Http Post)
  • 43. BLACK KPI DELIVERY ARCHITECTURE Slack message json HTTP POST Rest API Server Rest API Server generate graphMetrics Json Return Image HTTP POST Save/Get Image Plot.ly json Save Metrics to Hive Table Slack Spark Job Get Image URL Webhooks
  • 44. CONFIGURATION-DRIVEN slack-plus-push-weekly { //job name persist-metrics="true" channels { dse-metrics { post-urls { plus-metrics = "https://hooks.slack.com/services/XXXX" dse-metrics-test = "https://hooks.slack.com/services/XXXX" } plus-metrics { //metrics group //metrics in the same group will delivered as together in one message //metrics in different groups will be delivered as separate messages //overwrite above template with specific name } } } } //slack-plus-push-weekly
  • 45. SLACK METRICS CONFIGURATION slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group //… weekly-capture-users-by-platform { //metrics name slack-display.attachment.title = "GoPro Mobile App -- Users by Platform" metric-period = "weekly” slack-display.chartstyle { … } query = ""” … """ compare.query = ""” … """ chart query = ""”… ""” } //rest of configuration }
  • 46. SLACK DELIVERY BENEFITS • Pros: • Quick and easy access via Slack • Can quickly deliver to engineering manager, executives, business owner and product manager • 100+ members subscribed different channels, since we launch the service • Cons • Limited by Slack UI Real-States, can only display key metrics in two-column formats, only suitable for hive-level summary metrics
  • 48. EVOLUTION OF DATA PLATFORM 2016 Batch/Streaming Ingestion Framework Hard-coded Hive SQL ⇒ Spark Jobs and Spark SQL in Fixed Hadoop Cluster 2017 Platform Architecture Transformation Fixed Hadoop Clusters ⇒ Dynamic Elastic Cluster, Centralize Hive Metastore, Replace HDFS with S3 2018 Data Democratization, ML Infrastructure Slack Metrics delivery, add Druid OLAP database & Visualization with superset integration, Data Management Platform Initial attempts on ML Infrastructure Before 2016 Single Hadoop Cluster ⇒ Three Hadoop Clusters Hive/Impala SQL with Tableau Reports Fixed-size Hadoop Clusters
  • 49. FEATURE VISUALIZATION • Explore Feature Visualization via Google Facets • Part 1 : Overview • Part 2: Dive • What is Facets Overview ?
  • 50. FACETS OVERVIEW INTRODUCTION • From Facets Home Page • https://pair-code.github.io/facets/ • "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. • Overview can help uncover issues with datasets, including the following: • Unexpected feature values • Missing feature values for a large number of examples • Training/serving skew • Training/test/validation set skew • Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. • Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. • Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.
  • 52. FACETS OVERVIEW IMPLEMENTATIONS • The Facets-overview implementation is consists of • Feature Statistics Protocol Buffer definition • Feature Statistics Generation • Visualization • Visualization • The visualizations are implemented as Polymer web components, backed by Typescript code • It can be embedded into Jupyter notebooks or webpages. • Feature Statistics Generation • There are two implementations for the stats generation: Python and Javascripts • Python : using numpy, pandas to generate stats • JavaScripts: using javascripts to generate stats • Both implementations are running stats generation in brower
  • 54. FEATURE OVERVIEW SPARK • Initial exploration attempt • Is it possible to generate larger datasets with small stats size ? • can we generate stats leveraging distributed computing capability of spark instead just using one node ? • Can we generate the stats in Spark, and then used by Python and/or Javascripts ?
  • 55. FACETS OVERVIEW + SPARK ScalaPB
  • 56. PREPARE SPARK DATA FRAME case class NamedDataFrame(name:String, data: DataFrame) val features = Array("Age", "Workclass", ….) val trainData: DataFrame = loadCSVFile(”./adult.data.csv") val testData = loadCSVFile("./adult.test.txt") val train = trainData.toDF(features: _*) val test = testData.toDF(features: _*) val dataframes = List(NamedDataFrame(name = "train", train), NamedDataFrame(name = "test", test))
  • 57. SPARK FACETS STATS GENERATOR val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList()) val proto = generator.protoFromDataFrames(dataframes) persistProto(proto)
  • 58. SPARK FACETS STATS GENERATOR def protoFromDataFrames(dataFrames: List[NamedDataFrame], features : Set[String] = Set.empty[String], histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
  • 61. INITIAL FINDINGS • Implementation • 1st Pass implementation is not efficient • We have to go through each feature multiple paths, with increase number of features, the performance suffers, this limits number of features to be used • The size of dataset used for generate stats also determines the size of the generated protobuf file • I haven’t dive deeper into this as to what’s contributing the change of the size • The combination of data size and feature size can produce a large file, which won’t fit in browser • With Spark DataFrame, we can’t support Tensorflow Records • The Base64-encoded protobuf String can be loaded by Python or Javascripts • Protobuf binary file can also be loaded by Python • But it somehow not be able to loaded by Javascripts.
  • 62. WHAT’S NEXT? • Improve implementation performance • When we have a lot of data and features, what’s the proper size that generate proper stats size that can be loaded into browser or notebook ? • For example, One experiments: 300 Features è 200MB size • How do we efficiently partition the features so that can be viewable ? • Data is changing : how can we incremental update the stats on the regular basis ? • How to integrate this into production?
  • 63. PG # RC Playbook: Your guide to success at GoPro FINAL THOUGHTS
  • 64. FINAL THOUGHTS • We are still in the earlier stage of Data Platform Evolution, • We will continue to share we experience with you along the way. • Questions ? Thanks You Chester Chen, Ph.D. Data Science & Engineering GoPro