SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
1
OCTOBER 2015
Putting the Spark into Functional
Fashion Tech Analytics
Gareth Rogers Data Engineer November 2018
2
● Who are Metail and what we do
● Our data pipeline
● What is Apache Spark and our experiences
Introduction
3
Metail
Making clothing fit for all
4
● Create your MeModel with just a
few clicks
● See how the clothes look on you
● Primarily clickstream analysis
● Understanding the user journey
http://trymetail.com
The Metail Experience
5
Composed Photography
With Metail
Shoot model once
Choose poses
Style & restyle
Compositing
Shoot clothes
at source
● Understanding process flow
● Discovering bottlenecks
● Optimising the workflow
● Understanding costs
6
● Our pipeline is heavily influenced by functional programmings
paradigms
● Immutable data structures
● Declarative
● Pure functions -- effects only dependent on input state
● Minimise side effects
Functional Pipeline
7
Metail’s Data Pipeline
• Batch pipeline modelled on a lambda architecture
8
• Batch pipeline modelled on a lambda architecture
● Immutable datasets
● Batch layer append only
● Rebuild views rather than edit
● Serving layer for visualization
● Speed layer samples input data
○ Kappa architecture
Metail’s Data Pipeline - The lambda architecture
9
• Managed by applications written in Clojure
Metail’s Data Pipeline - Driving the Pipeline
10
• Clojure is predominately a functional programming language
• Aims to be approachable
• Allow interactive development
• A lisp programming language
• Everything is dynamically typed
• Runs on the JVM or compiled to JavaScript
Clojure
11
• Running on the JVM
– access to all the Java ecosystem and learnings
– Java interop well supported
– not Java though
Metail’s Data Pipeline - Driving the Pipeline
12
Metail’s Data Pipeline - Driving the Pipeline
Snowplow Analytics (shameful plug, we
use their platform and I like the founders)
13
Metail’s Data Pipeline - Driving the Pipeline
14
• Transformed by Clojure, Spark in Clojure or SQL
• Clojure used for datasets with well defined size
– These easily run on a single JVM
– Dataset size always within an order of magnitude
• Spark
– Dataset sizes can vary over a few orders of magnitude
• SQL is typically used in the serving layer and for dashboarding
and BI tools
Metail’s Data Pipeline - Transforming the Data
15
Metail’s Data Pipeline
16
Metail’s Data Pipeline
To analytics
dashboards
https://looker.com/
17
• Spark is a general purpose distributed data processing engine
– Can scale from processing a single line to terabytes of data
• Functional paradigm
– Declarative
– Functions are first class
– Immutable datasets
• Written in Scala a JVM based language
– Just like Clojure
– Has a Java API
– Clojure has great Java interop
Apache Spark
18
Apache Spark
https://databricks.com/spark/about
19
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
20
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
• Consists of a driver and multiple workers
• Declare a Spark session and build a job graph
• Driver coordinates with a cluster to execute the graph on the workers
21
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
• Operates on in-memory datasets
– where possible, it will spill to disk
• Based on Resilient Distributed Datasets (RDDs)
• Each RDD represents on dataset
• Split into multiple partitions distributed through
the cluster
22
• Operates on a Directed
Acyclic Graph (DAG)
– Link together data
processing steps
– May be dependent on
multiple parent steps
– Transforms cannot
return to an older
partition
Apache Spark
23
• Using the Sparkling Clojure library
– This wraps the Spark Java API
– Handles Clojure data structure and function serialisation
• For example counting user-sessions
– This is simple but doesn’t scale, returns everything to driver
memory
Metail and Spark
24
• Two variants that return PairRRDs
• combineByKey is more general than aggregateByKey
• Note I did the Java interop as Sparkling doesn’t cover this API
– Easy when you can use their serialization library
Metail and Spark
25
• Mostly using the core API
• RDDs holding rows of Clojure maps
• Would like to migrate to the Dataset API
Metail and Spark
26
• Runs on AWS Elastic MapReduce
– Tune your cluster
• This is an example for my limited VM, a cluster would use
bigger values!
Metail and Spark
27
• Very scalable
– but the distributed environment adds overhead
• Sparkling does a lot of the hard work
– Clojure not a supported language
• Good documentation
– Sometimes it’s hard to figure out which way to do
something
– Lots of deprecated methods
• Declarative language + Clojure interop makes stacktraces hard
to interpret
• Dataset API is heavily optimised
– Would remove a lot of the Clojure interop
Metail and Spark - Pros and Cons
28
• Metail is making clothing fit for all
• We’re incorporating metrics derived from our collected data
• We have several pipelines collecting, transforming and
visualising our data
• When dealing with datasets functional programming offers
many advantages
• Give them a go!
– https://github.com/gareth625/lhcb-opendata
Summary
29
• Learning Clojure: https://www.braveclojure.com/
• A random web based Clojure REPL: https://repl.it/repls/
• Basis of Metail Experience pipeline https://snowplowanalytics.com
• Dashboarding and SQL warehouse management: https://looker.com
• Tuning your Spark cluster
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
Resources
30
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewVaclav Kosar
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Ruby to Scala in 9 weeks
Ruby to Scala in 9 weeksRuby to Scala in 9 weeks
Ruby to Scala in 9 weeksjutley
 
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsAlpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsKnoldus Inc.
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介LINE Corporation
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
Metrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamMetrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamLINE Corporation
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Lviv Startup Club
 
Stream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaStream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaDiego Pacheco
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 

Was ist angesagt? (20)

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Ruby to Scala in 9 weeks
Ruby to Scala in 9 weeksRuby to Scala in 9 weeks
Ruby to Scala in 9 weeks
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsAlpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Metrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamMetrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability Team
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
 
Stream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaStream Processing with Kafka and Samza
Stream Processing with Kafka and Samza
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 

Ähnlich wie Putting the Spark into Functional Fashion Tech Analytics

Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduceGareth Rogers
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksSarah Dutkiewicz
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1Ivan Ma
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudMark Rittman
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersTobias Koprowski
 

Ähnlich wie Putting the Spark into Functional Fashion Tech Analytics (20)

Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark
SparkSpark
Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
 

Kürzlich hochgeladen

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 

Kürzlich hochgeladen (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 

Putting the Spark into Functional Fashion Tech Analytics

  • 1. 1 OCTOBER 2015 Putting the Spark into Functional Fashion Tech Analytics Gareth Rogers Data Engineer November 2018
  • 2. 2 ● Who are Metail and what we do ● Our data pipeline ● What is Apache Spark and our experiences Introduction
  • 4. 4 ● Create your MeModel with just a few clicks ● See how the clothes look on you ● Primarily clickstream analysis ● Understanding the user journey http://trymetail.com The Metail Experience
  • 5. 5 Composed Photography With Metail Shoot model once Choose poses Style & restyle Compositing Shoot clothes at source ● Understanding process flow ● Discovering bottlenecks ● Optimising the workflow ● Understanding costs
  • 6. 6 ● Our pipeline is heavily influenced by functional programmings paradigms ● Immutable data structures ● Declarative ● Pure functions -- effects only dependent on input state ● Minimise side effects Functional Pipeline
  • 7. 7 Metail’s Data Pipeline • Batch pipeline modelled on a lambda architecture
  • 8. 8 • Batch pipeline modelled on a lambda architecture ● Immutable datasets ● Batch layer append only ● Rebuild views rather than edit ● Serving layer for visualization ● Speed layer samples input data ○ Kappa architecture Metail’s Data Pipeline - The lambda architecture
  • 9. 9 • Managed by applications written in Clojure Metail’s Data Pipeline - Driving the Pipeline
  • 10. 10 • Clojure is predominately a functional programming language • Aims to be approachable • Allow interactive development • A lisp programming language • Everything is dynamically typed • Runs on the JVM or compiled to JavaScript Clojure
  • 11. 11 • Running on the JVM – access to all the Java ecosystem and learnings – Java interop well supported – not Java though Metail’s Data Pipeline - Driving the Pipeline
  • 12. 12 Metail’s Data Pipeline - Driving the Pipeline Snowplow Analytics (shameful plug, we use their platform and I like the founders)
  • 13. 13 Metail’s Data Pipeline - Driving the Pipeline
  • 14. 14 • Transformed by Clojure, Spark in Clojure or SQL • Clojure used for datasets with well defined size – These easily run on a single JVM – Dataset size always within an order of magnitude • Spark – Dataset sizes can vary over a few orders of magnitude • SQL is typically used in the serving layer and for dashboarding and BI tools Metail’s Data Pipeline - Transforming the Data
  • 16. 16 Metail’s Data Pipeline To analytics dashboards https://looker.com/
  • 17. 17 • Spark is a general purpose distributed data processing engine – Can scale from processing a single line to terabytes of data • Functional paradigm – Declarative – Functions are first class – Immutable datasets • Written in Scala a JVM based language – Just like Clojure – Has a Java API – Clojure has great Java interop Apache Spark
  • 19. 19 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master
  • 20. 20 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master • Consists of a driver and multiple workers • Declare a Spark session and build a job graph • Driver coordinates with a cluster to execute the graph on the workers
  • 21. 21 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master • Operates on in-memory datasets – where possible, it will spill to disk • Based on Resilient Distributed Datasets (RDDs) • Each RDD represents on dataset • Split into multiple partitions distributed through the cluster
  • 22. 22 • Operates on a Directed Acyclic Graph (DAG) – Link together data processing steps – May be dependent on multiple parent steps – Transforms cannot return to an older partition Apache Spark
  • 23. 23 • Using the Sparkling Clojure library – This wraps the Spark Java API – Handles Clojure data structure and function serialisation • For example counting user-sessions – This is simple but doesn’t scale, returns everything to driver memory Metail and Spark
  • 24. 24 • Two variants that return PairRRDs • combineByKey is more general than aggregateByKey • Note I did the Java interop as Sparkling doesn’t cover this API – Easy when you can use their serialization library Metail and Spark
  • 25. 25 • Mostly using the core API • RDDs holding rows of Clojure maps • Would like to migrate to the Dataset API Metail and Spark
  • 26. 26 • Runs on AWS Elastic MapReduce – Tune your cluster • This is an example for my limited VM, a cluster would use bigger values! Metail and Spark
  • 27. 27 • Very scalable – but the distributed environment adds overhead • Sparkling does a lot of the hard work – Clojure not a supported language • Good documentation – Sometimes it’s hard to figure out which way to do something – Lots of deprecated methods • Declarative language + Clojure interop makes stacktraces hard to interpret • Dataset API is heavily optimised – Would remove a lot of the Clojure interop Metail and Spark - Pros and Cons
  • 28. 28 • Metail is making clothing fit for all • We’re incorporating metrics derived from our collected data • We have several pipelines collecting, transforming and visualising our data • When dealing with datasets functional programming offers many advantages • Give them a go! – https://github.com/gareth625/lhcb-opendata Summary
  • 29. 29 • Learning Clojure: https://www.braveclojure.com/ • A random web based Clojure REPL: https://repl.it/repls/ • Basis of Metail Experience pipeline https://snowplowanalytics.com • Dashboarding and SQL warehouse management: https://looker.com • Tuning your Spark cluster http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ Resources