SlideShare a Scribd company logo
1 of 27
Download to read offline
Building Data Products at LinkedIn with DataFu
©2013 LinkedIn Corporation. All Rights Reserved.
Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
Tools of the trade
©2013 LinkedIn Corporation. All Rights Reserved.
What tools do we use?
Languages:
 Java (MapReduce)
 Pig
 R
 Hive
 Crunch
Systems:
 Voldemort
 Kafka
 Azkaban
©2013 LinkedIn Corporation. All Rights Reserved.
Pig: Usually the language of choice
 High-level data flow language that produces MapReduce jobs
 Used extensively at LinkedIn for building data products.
 Why?
– Concise (compared to Java)
– Expressive
– Mature
– Easy to use and understand
– More aproachable than Java for some
– Easy to learn if you know SQL
– Easy to learn even if you don't know SQL
– Extensible through UDFs
– Reports task statistics
©2013 LinkedIn Corporation. All Rights Reserved.
Pig: Extensibility
 Several types of UDFs you can write:
– Eval
– Algebraic
– Accumulator
 We do this a lot.
 Over time we accumulated a lot of useful UDFs
 Decided to open source them as DataFu library
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu
Collection of UDFs for Pig
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu: History
 Several teams were developing UDFs
 But:
– Not centralized in one library
– Not shared
– No automated tests
 Solution:
– Packaged UDFs in DataFu library
– Automated unit tests, mostly through PigUnit
 Started out as internal project.
 Open sourced September, 2011.
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu Examples
Collection of UDFs for Pig
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu: Assert UDF
 About as simple as a UDF gets. Blows up when it encounters zero.
 A convenient way to validate assumptions about data.
 What if member IDs can't and shouldn't be zero? Assert on this
condition:
©2013 LinkedIn Corporation. All Rights Reserved.
data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!');
public Boolean exec(Tuple tuple) throws IOException {
if ((Integer) tuple.get(0) == 0) {
if (tuple.size() > 1)
throw new IOException("Assertion violated: " + tuple.get(1).toString());
else
throw new IOException("Assertion violated.");
}
else return true;
}
 Implementation:
DataFu: Coalesce UDF
 Using ternary operators is fairly common in Pig.
 Replace null values with zero:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 :
(val2 IS NOT NULL ? val2 :
(val3 IS NOT NULL ? val3 :
NULL))) as result;
 Return first non-null value among several fields:
 Unfortunately, out of the box there's no better way to do this in Pig.
DataFu: Coalesce UDF
 Simplify the code using the Coalesce UDF from DataFu
– Behaves the same as COALESCE in SQL
 Replace any null value with 0:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FOREACH data GENERATE Coalesce(val,0) as result;
 Return first non-null value:
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;
public Object exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
for (Object o : input) {
if (o != null) return o;
}
return null;
}
 Implementation:
DataFu: In UDF
 Suppose we want to filter some data based on a field equalling one
of many values.
 Can chain together conditional checks using OR:
©2013 LinkedIn Corporation. All Rights Reserved.
data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
dump data;
-- (roses,red)
-- (violets,blue)
-- (sugar,sweet)
data = FILTER data BY adj == 'red' OR adj == 'blue';
dump data;
-- (roses,red)
-- (violets,blue)
 As the number of items grows this really becomes a pain.
DataFu: In UDF
 Much simpler using the In UDF:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FILTER data BY In(adj,'red','blue');
public Boolean exec(Tuple input) throws IOException
{
Object o = input.get(0);
Boolean match = false;
if (o != null) {
for (int i=1; i<input.size() && !match; i++) {
match = match || o.equals(input.get(i));
}
}
return match;
}
 Implementation:
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Suppose we have a system that recommends items to users.
 We've tracked what items have been recommended:
items = FOREACH items GENERATE memberId, itemId;
• Let's count how many times each item has been shown to a user.
• Desired output schema:
{memberId: int,items: {(itemId: long,cnt: long)}}
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Typically, we would first count (member,item) pairs:
items = GROUP items BY (memberId,itemId);
items = FOREACH items GENERATE
group.memberId as memberId,
group.itemId as itemId,
COUNT(items) as cnt;
 Then we would group again on member:
items = GROUP items BY memberId;
items = FOREACH items generate
group as memberId,
items.(itemId,cnt) as items;
• But, this requires two MapReduce jobs!
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Using the CountEach UDF, we can accomplish the same thing with
one MR job and much less code:
items = FOREACH (GROUP items BY memberId) generate
group as memerId,
CountEach(items.(itemId)) as items;
• Not only is it more concise, but it has better performance:
– Wall clock time: 50% reduction
– Total task time: 33% reduction
DataFu: Session Statistics
 Session: A period of sustained user activity
 Suppose we have a stream of user clicks:
©2013 LinkedIn Corporation. All Rights Reserved.
pv = LOAD 'pageviews.csv' USING PigStorage(',')
AS (memberId:int, time:long, url:chararray);
 What session length statistics are we interested in?
– Median
– Variance
– Percentiles (90th, 95th)
 How will we define a session?
– In this example: No gaps in activity greater than 10 minutes
DataFu: Session Statistics
 Define our UDFs:
©2013 LinkedIn Corporation. All Rights Reserved.
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');
DEFINE VAR datafu.pig.stats.VAR();
DataFu: Session Statistics
 Sessionize the data, appending a session ID to each tuple
©2013 LinkedIn Corporation. All Rights Reserved.
pv = FOREACH pv GENERATE time, memberId;
pv_sessionized = FOREACH (GROUP pv BY memberId) {
ordered = ORDER pv BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, memberId, sessionId);
};
pv_sessionized = FOREACH pv_sessionized GENERATE
sessionId, memberId, time;
DataFu: Session Statistics
 Compute session length in minutes:
©2013 LinkedIn Corporation. All Rights Reserved.
session_times =
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))
GENERATE group.sessionId as sessionId,
group.memberId as memberId,
(MAX(pv_sessionized.time) -
MIN(pv_sessionized.time))
/ 1000.0 / 60.0 as session_length;
 Computes session length statistics:
session_stats = FOREACH (GROUP session_times ALL) {
GENERATE
AVG(ordered.session_length) as avg_session,
SQRT(VAR(ordered.session_length)) as std_dev_session,
Median(ordered.session_length) as median_session,
Quantile(ordered.session_length) as quantiles_session;
};
DUMP session_stats
DataFu: Session Statistics
 Who are the most engaged users?
 Report users who had sessions in the upper 95th percentile:
©2013 LinkedIn Corporation. All Rights Reserved.
long_sessions =
filter session_times by
session_length >
session_stats.quantiles_session.quantile_0_95;
very_engaged_users =
DISTINCT (FOREACH long_sessions GENERATE memberId);
DUMP very_engaged_users
DataFu: Left join multiple relations
 Suppose we have three data sets:
©2013 LinkedIn Corporation. All Rights Reserved.
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);
joined = JOIN input1 BY key LEFT,
input2 BY key,
input3 BY key;
 We want to left join input1 with input2 and input3.
 Unfortunately, in Pig you can only perform outer joins on two
relations.
 This doesn't work:
DataFu: Left join multiple relations
 Instead you have to left join twice:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = JOIN input1 BY key LEFT, input2 BY key;
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;
 This is inefficient, as it requires two MapReduce jobs!
 Left joins are very common
 Take a recommendation system for example:
– Typically you build a candidate set, then join in features.
– As number of features increases, so can number of joins.
DataFu: Left join multiple relations
 But, there's always COGROUP:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2))
as (input2::key,input2::val),
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3))
as (input3::key,input3::val);
 COGROUP is the same as GROUP
– Convention: Use COGROUP instead of GROUP for readability.
 This is ugly and hard to follow, but it does work.
 The code wouldn't be so bad if it weren't for the nasty ternary
expression.
 Perfect opportunity for writing a UDF.
DataFu: Left join multiple relations
 We wrote EmptyBagToNullFields to replace this ternary logic.
 Much cleaner:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN(EmptyBagToNullFields(input2)),
FLATTEN(EmptyBagToNullFields(input3));
data.linkedin.com
Learning More
©2012 LinkedIn Corporation. All Rights Reserved. 27

More Related Content

What's hot

Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation frameworkJoseph Adler
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 

What's hot (20)

Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 

Similar to Building Data Products at LinkedIn with DataFu

Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Operationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyOperationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyPrasanna Gautam
 
Priming Java for Speed at Market Open
Priming Java for Speed at Market OpenPriming Java for Speed at Market Open
Priming Java for Speed at Market OpenAzul Systems Inc.
 
Python for scientific computing
Python for scientific computingPython for scientific computing
Python for scientific computingGo Asgard
 
Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4DataWorks Summit
 
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInDataWorks Summit
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData InfluxData
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Rent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesRent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesDan Chan
 
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...DevOpsDays Tel Aviv
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.ioItay Weinberger
 

Similar to Building Data Products at LinkedIn with DataFu (20)

Week 3 ILab
Week 3 ILabWeek 3 ILab
Week 3 ILab
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
Zvika markfeld
Zvika markfeldZvika markfeld
Zvika markfeld
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Operationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyOperationalizing Clojure Confidently
Operationalizing Clojure Confidently
 
Priming Java for Speed at Market Open
Priming Java for Speed at Market OpenPriming Java for Speed at Market Open
Priming Java for Speed at Market Open
 
Python for scientific computing
Python for scientific computingPython for scientific computing
Python for scientific computing
 
Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4
 
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
 
Rsockets ofa12
Rsockets ofa12Rsockets ofa12
Rsockets ofa12
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Rent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesRent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven Webservices
 
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io
 

Recently uploaded

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 

Recently uploaded (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 

Building Data Products at LinkedIn with DataFu

  • 1. Building Data Products at LinkedIn with DataFu ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved.
  • 3. Tools of the trade ©2013 LinkedIn Corporation. All Rights Reserved.
  • 4. What tools do we use? Languages:  Java (MapReduce)  Pig  R  Hive  Crunch Systems:  Voldemort  Kafka  Azkaban ©2013 LinkedIn Corporation. All Rights Reserved.
  • 5. Pig: Usually the language of choice  High-level data flow language that produces MapReduce jobs  Used extensively at LinkedIn for building data products.  Why? – Concise (compared to Java) – Expressive – Mature – Easy to use and understand – More aproachable than Java for some – Easy to learn if you know SQL – Easy to learn even if you don't know SQL – Extensible through UDFs – Reports task statistics ©2013 LinkedIn Corporation. All Rights Reserved.
  • 6. Pig: Extensibility  Several types of UDFs you can write: – Eval – Algebraic – Accumulator  We do this a lot.  Over time we accumulated a lot of useful UDFs  Decided to open source them as DataFu library ©2013 LinkedIn Corporation. All Rights Reserved.
  • 7. DataFu Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  • 8. DataFu: History  Several teams were developing UDFs  But: – Not centralized in one library – Not shared – No automated tests  Solution: – Packaged UDFs in DataFu library – Automated unit tests, mostly through PigUnit  Started out as internal project.  Open sourced September, 2011. ©2013 LinkedIn Corporation. All Rights Reserved.
  • 9. DataFu Examples Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  • 10. DataFu: Assert UDF  About as simple as a UDF gets. Blows up when it encounters zero.  A convenient way to validate assumptions about data.  What if member IDs can't and shouldn't be zero? Assert on this condition: ©2013 LinkedIn Corporation. All Rights Reserved. data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!'); public Boolean exec(Tuple tuple) throws IOException { if ((Integer) tuple.get(0) == 0) { if (tuple.size() > 1) throw new IOException("Assertion violated: " + tuple.get(1).toString()); else throw new IOException("Assertion violated."); } else return true; }  Implementation:
  • 11. DataFu: Coalesce UDF  Using ternary operators is fairly common in Pig.  Replace null values with zero: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result; data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : (val2 IS NOT NULL ? val2 : (val3 IS NOT NULL ? val3 : NULL))) as result;  Return first non-null value among several fields:  Unfortunately, out of the box there's no better way to do this in Pig.
  • 12. DataFu: Coalesce UDF  Simplify the code using the Coalesce UDF from DataFu – Behaves the same as COALESCE in SQL  Replace any null value with 0: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE Coalesce(val,0) as result;  Return first non-null value: data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result; public Object exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; for (Object o : input) { if (o != null) return o; } return null; }  Implementation:
  • 13. DataFu: In UDF  Suppose we want to filter some data based on a field equalling one of many values.  Can chain together conditional checks using OR: ©2013 LinkedIn Corporation. All Rights Reserved. data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray); dump data; -- (roses,red) -- (violets,blue) -- (sugar,sweet) data = FILTER data BY adj == 'red' OR adj == 'blue'; dump data; -- (roses,red) -- (violets,blue)  As the number of items grows this really becomes a pain.
  • 14. DataFu: In UDF  Much simpler using the In UDF: ©2013 LinkedIn Corporation. All Rights Reserved. data = FILTER data BY In(adj,'red','blue'); public Boolean exec(Tuple input) throws IOException { Object o = input.get(0); Boolean match = false; if (o != null) { for (int i=1; i<input.size() && !match; i++) { match = match || o.equals(input.get(i)); } } return match; }  Implementation:
  • 15. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Suppose we have a system that recommends items to users.  We've tracked what items have been recommended: items = FOREACH items GENERATE memberId, itemId; • Let's count how many times each item has been shown to a user. • Desired output schema: {memberId: int,items: {(itemId: long,cnt: long)}}
  • 16. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Typically, we would first count (member,item) pairs: items = GROUP items BY (memberId,itemId); items = FOREACH items GENERATE group.memberId as memberId, group.itemId as itemId, COUNT(items) as cnt;  Then we would group again on member: items = GROUP items BY memberId; items = FOREACH items generate group as memberId, items.(itemId,cnt) as items; • But, this requires two MapReduce jobs!
  • 17. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Using the CountEach UDF, we can accomplish the same thing with one MR job and much less code: items = FOREACH (GROUP items BY memberId) generate group as memerId, CountEach(items.(itemId)) as items; • Not only is it more concise, but it has better performance: – Wall clock time: 50% reduction – Total task time: 33% reduction
  • 18. DataFu: Session Statistics  Session: A period of sustained user activity  Suppose we have a stream of user clicks: ©2013 LinkedIn Corporation. All Rights Reserved. pv = LOAD 'pageviews.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray);  What session length statistics are we interested in? – Median – Variance – Percentiles (90th, 95th)  How will we define a session? – In this example: No gaps in activity greater than 10 minutes
  • 19. DataFu: Session Statistics  Define our UDFs: ©2013 LinkedIn Corporation. All Rights Reserved. DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95'); DEFINE VAR datafu.pig.stats.VAR();
  • 20. DataFu: Session Statistics  Sessionize the data, appending a session ID to each tuple ©2013 LinkedIn Corporation. All Rights Reserved. pv = FOREACH pv GENERATE time, memberId; pv_sessionized = FOREACH (GROUP pv BY memberId) { ordered = ORDER pv BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, memberId, sessionId); }; pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, memberId, time;
  • 21. DataFu: Session Statistics  Compute session length in minutes: ©2013 LinkedIn Corporation. All Rights Reserved. session_times = FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) GENERATE group.sessionId as sessionId, group.memberId as memberId, (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) / 1000.0 / 60.0 as session_length;  Computes session length statistics: session_stats = FOREACH (GROUP session_times ALL) { GENERATE AVG(ordered.session_length) as avg_session, SQRT(VAR(ordered.session_length)) as std_dev_session, Median(ordered.session_length) as median_session, Quantile(ordered.session_length) as quantiles_session; }; DUMP session_stats
  • 22. DataFu: Session Statistics  Who are the most engaged users?  Report users who had sessions in the upper 95th percentile: ©2013 LinkedIn Corporation. All Rights Reserved. long_sessions = filter session_times by session_length > session_stats.quantiles_session.quantile_0_95; very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); DUMP very_engaged_users
  • 23. DataFu: Left join multiple relations  Suppose we have three data sets: ©2013 LinkedIn Corporation. All Rights Reserved. input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT); input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT); input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT); joined = JOIN input1 BY key LEFT, input2 BY key, input3 BY key;  We want to left join input1 with input2 and input3.  Unfortunately, in Pig you can only perform outer joins on two relations.  This doesn't work:
  • 24. DataFu: Left join multiple relations  Instead you have to left join twice: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = JOIN input1 BY key LEFT, input2 BY key; data2 = JOIN data1 BY input1::key LEFT, input3 BY key;  This is inefficient, as it requires two MapReduce jobs!  Left joins are very common  Take a recommendation system for example: – Typically you build a candidate set, then join in features. – As number of features increases, so can number of joins.
  • 25. DataFu: Left join multiple relations  But, there's always COGROUP: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) as (input2::key,input2::val), FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) as (input3::key,input3::val);  COGROUP is the same as GROUP – Convention: Use COGROUP instead of GROUP for readability.  This is ugly and hard to follow, but it does work.  The code wouldn't be so bad if it weren't for the nasty ternary expression.  Perfect opportunity for writing a UDF.
  • 26. DataFu: Left join multiple relations  We wrote EmptyBagToNullFields to replace this ternary logic.  Much cleaner: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN(EmptyBagToNullFields(input2)), FLATTEN(EmptyBagToNullFields(input3));
  • 27. data.linkedin.com Learning More ©2012 LinkedIn Corporation. All Rights Reserved. 27

Editor's Notes

  1. Today I&apos;m going to talk about how we we use Hadoop at LinkedIn to build products with data.
  2. So far covered building data products at a high level. Now let&apos;s look more at the tools we use work with the data.
  3. This is a non-exhaustive list of some of the tools we use to develop data products at LinkedIn. I&apos;m going to only focus on Pig here.
  4. Mention that will focus on Pig for the remainder, because it is used so heavily within LinkedIn for building data products.
  5. Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  6. Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  7. We use Coalesce because with endorsements we are joining in features to candidates for ranking purposes. There may not be a feature corresponding to a candidate, in which case we want to replace with zero.
  8. CountEach is used by endorsements. We recommend itmes to members and want counts to improve our algorithms.
  9. There are also non-streaming versions of median and quantiles, but these are less efficient because they require the input data to be sorted.
  10. Left joins are used quite often. We use it a lot in endorsements. Again, we have candidates and need to join in features for ranking. We don&apos;t want to eliminate a candidate if there isn&apos;t a corresponding feature.