SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Conquering the Lambda architecture in LinkedIn metrics
platform with Apache Calcite and Apache Samza
​Khai Tran
​Staff Software Engineer
Agenda
● Overview of LinkedIn metrics platform
● Moving from offline to nearline
● Under the hood of the nearline architecture
● Nearline production usecase
● Conclusion
Overview of LinkedIn metrics platform
Metrics @ LinkedIn
● Metrics = Measurements over tracking data
● Crucial for decision making:
○ Experimentation - test everything
○ Reporting - monitor and alert
○ In production, site-facing applications
We provide:
● A trusted repository of metrics
● A self-serve platform for
sustainable lifecycle of metrics
In
production
Experimentation
Reporting
Primary Data
Unified
Metrics
Platform
LinkedIn unified metrics platform (UMP)
Growth of UMP Metrics
2016 20172015
6800
4680
1100
Current: 8000+ metrics
# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
- A = SUM(A’)
- B = Unique(id)
Downstream:
- XLNT
- Raptor
UMP
User Code
Platform
Generated
Code
To
App
To
App
DefineDeclare
Onboard
Data
Metadata
Onboarding process
User
Moving from offline to nearline
Offline computation flows
Hourly job latency: 3-6 hours -> want realtime/nearline
......
Metric union
User code
User code
Cubing, Rollup
Dimension
augmentation
HDFS tables
Dali views
Pinot,
Presto
Azkaban execution
Espresso,
Oracle,
MySQL
...
What we want for nearline flows
Metric unionUser code
User code
Samza job
Dimension
augmentation
Pinot
Latency is not the only requirement
Easy to onboard ● Minimum effort to convert existing offline into nearline
● Easy to write user code for new nearline flows
Easy to maintain ● Just one version of user code - single source of truth
● Run as a service
Latency ● ~5 - 30 mins
Samza jobs
Putting things together
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
Current support
User code in Pig ● LOAD, STORE
● FILTER, SAMPLE, SPLIT, UNION
● Simple FOREACH
● JOIN - all semantics
● GROUP/COGROUP, DISTINCT
● Record/Array FLATTEN
● Java UDFs, Python UDFs
● Pig Nested FOREACH and sort/limit (in Windows)
● Hive
Not yet
Under the hood of the nearline architecture
Pig to Samza through SQL processing
Open source framework for building dynamic
data management systems. Including:
➢ SQL Parser
➢ Relational algebra APIs
➢ Query planning engine
We built UMP nearline with:
➢ Pig’s Grunt parser
➢ Calcite relational algebra
➢ Calcite query planning engine
Architecture
...
Metric union
User code
User code
Dimension
augmentation
Calcite relational
algebra as an IR
convert generate
Samza code
optimize
Samza
physical plan
Samza
configuration
Pig to Calcite Calcite to Samza
Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP ...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTERJ
OIN
AGGRE
GATE
AGGRE
GATE
TABLE
SCAN
TABLE
SCAN
PRO-
JECT
User scripts Pig Logical Plan Calcite relational algebra
Example
Example
Example
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite logical plan
Planning/Optimization
➢ Calcite logical plans:
○ Relational algebra: What to do
➢ Samza physical plans:
○ Samza physical node: How to do it
➢ Calcite Samza planner:
○ Calcite logical plan -> optimized Samza physical plan
Example
Stream-
Stream Self
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite Samza
planner
Calcite logical plan Samza physical plan
Code-gen
From Samza physical plans:
➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs .
Mapping:
➢ Samza physical nodes -> corresponding stream APIs:
○ Samza project -> stream.map()
○ Samza filter -> stream.filter()
○ ...
➢ Relational expressions -> lamba functions:
○ Filter expressions -> filter() functions
○ Project expressions -> map() functions
○ ...
Example
Schema and UDF declarations
Operator mapping
Filter functions
Map functions
Produce to Kafka
...
Config-gen
Stream
Stream
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
# dataset.conf
app-src
app-def
Nearline production use case - Storylines
Top stories picked
up by editors
Feedback to editor - powered by UMP realtime
Conclusion
Samza jobs
From improved Lambda architecture...
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
… to our bigger picture
Pig Latin
Calcite
relational
algebra
HiveQL
SparkSQL/
RDD
Presto SQL
Portable
UDFs
AORA (Author Once, Run Anywhere) architecture
Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

Weitere ähnliche Inhalte

Was ist angesagt?

Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Mother of Language`s Langchain
Mother of Language`s LangchainMother of Language`s Langchain
Mother of Language`s LangchainJun-hang Lee
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricksLiangjun Jiang
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Diveconfluent
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 

Was ist angesagt? (20)

Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Mother of Language`s Langchain
Mother of Language`s LangchainMother of Language`s Langchain
Mother of Language`s Langchain
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your Architecture
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 

Ähnlich wie Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...Joseph Kuo
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Chun-Yu Tseng
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Angular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - LinagoraAngular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - LinagoraLINAGORA
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14p6academy
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 

Ähnlich wie Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza (20)

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...
JCConf 2017 - Next Generation of Cloud Computing: Edge Computing and Apache E...
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Angular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - LinagoraAngular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - Linagora
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 

Kürzlich hochgeladen

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Kürzlich hochgeladen (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

  • 1. Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza ​Khai Tran ​Staff Software Engineer
  • 2. Agenda ● Overview of LinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
  • 3. Overview of LinkedIn metrics platform
  • 4. Metrics @ LinkedIn ● Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
  • 5. We provide: ● A trusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
  • 6. Growth of UMP Metrics 2016 20172015 6800 4680 1100 Current: 8000+ metrics
  • 7. # code LOAD … # data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
  • 8. Moving from offline to nearline
  • 9. Offline computation flows Hourly job latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
  • 10. ... What we want for nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
  • 11. Latency is not the only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
  • 12. Samza jobs Putting things together Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 13. Current support User code in Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
  • 14. Under the hood of the nearline architecture
  • 15. Pig to Samza through SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
  • 16. Architecture ... Metric union User code User code Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
  • 17. Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
  • 21. Planning/Optimization ➢ Calcite logical plans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
  • 22. Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
  • 23. Code-gen From Samza physical plans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lamba functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
  • 24. Example Schema and UDF declarations Operator mapping Filter functions Map functions Produce to Kafka ...
  • 26. Nearline production use case - Storylines
  • 27. Top stories picked up by editors
  • 28. Feedback to editor - powered by UMP realtime
  • 30. Samza jobs From improved Lambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 31. … to our bigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture