SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Let your Big Data Processing take flight with
Apache Falcon
{pallavi.rao, pragya.mittal}@inmobi.com
$whoami
❖ Pallavi Rao
➢ Architect, InMobi
➢ Committer, Apache Falcon
➢ Contributor, Apache PIG
❖ Pragya Mittal
➢ Engineer, InMobi
➢ Committer, Apache Falcon
What is in store for you?
● Some history
● Motivation for Apache Falcon
● Overview of Apache Falcon
● How Apache Falcon is used @ InMobi
● Roadmap
● Q&A
Once upon a time…
Life was simple...
Click Logs Click Enhancer Enhanced
Clicks
Hourly
Aggregation Hourly
Clicks
Daily
clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 mins
Late Data arrival
Retention : 2 days
Replication required
Retention : 1 day
Retention : 7 days
Replication required
● Cron jobs to delete/archive data.
● Cron jobs to copy data from one cluster to
another.
● Email notifications and manual retries in case of
failures.
But, didn’t remain that way...
❏ Failures
❏ Data arriving late
❏ Re-processing
❏ Varied Data Replication
❏ Varied Data Retention
❏ Data Archival
❏ Lineage
❏ SLA monitoring
The problem pattern
Problem areas
Data
Management
Data
Governance
Process
Management
● Relays
● Late Data Handling
● Failure Retries
● Reruns
● Data Import/Export
● Retention
● Replication
● Archival
● Lineage
● Audit
● SLA
Overview
The concoction
Sample pipeline in Falcon
Click Logs Click Enhancer Enhanced
Clicks
Hourly
Aggregation Hourly
Clicks
Daily
clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 mins
Late Data arrival
Retention : 2 days
Replication required
Retention : 1 day
Retention : 7 days
Replication required
Falcon Feed
Falcon Process CLUSTER
Cluster Specification
<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1">
<tags>consumer=consumer@xyz.com, owner=producer@xyz.com,
_department_type=forecasting</tags>
<interfaces>
<interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/>
<interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/>
<interface type="execute" endpoint="localhost:8032" version="1.1.2"/>
<interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/>
<interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/>
</interfaces>
<locations>
<location name="staging" path="/projects/falcon/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/projects/falcon/working"/>
</locations>
<properties>
<property name="field1" value="value1"/>
</properties>
</cluster>
How to access data
Where to execute
HCat
Falcon cache
User defined props
Concoction.. distributed
Process Management
Process Relays
● Data dependency among the process in the pipeline.
● Can wait on imported data or replicated data.
Late Data Arrival
● Defines how late data is handled.
● How long to wait for the data and how to check for late data.
● Optional separate logic to handle late data processing.
Process Specification
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1">
<clusters>
<cluster name="corp">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/>
</cluster>
<parallel>1</parallel>
<order>LIFO</order>
<frequency>hours(1)</frequency>
<inputs>
<input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)"
partition="*/US"/>
</inputs>
<outputs>
<output name="clicksummary" feed="click-hourly" instance="today(0,0)"/>
</outputs>
<workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="
/user/guest/workflowlib"/>
<retry policy="periodic" delay="hours(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="click" workflow-path="hdfs://clicks/late/workflow"/>
</late-process>
</process>
Where should the process
run?
How should the process
run?
What to consume?
What to produce?
Processing logic
Late Data processing
Process Scheduling
Data Management
Data Retention As a Service
● Data grows rapidly and cannot be kept around for ever. Needs to be deleted
or archived.
● Retention policy is dictated by the kind of data.
Data Replication As a Service
● Replication required for
○ Disaster Recovery.
○Local/Global Aggregation.
● Configurable resource consumption.
Feed Specification
<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:
feed:0.1">
<frequency>minutes(5)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="hours(2)" slaHigh="hours(3)"/>
<clusters>
<cluster name="corp" type="source">
<validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name="secondary" type="target">
<validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
<locations>
<location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}
/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
</cluster>
</clusters>
…..
</feed>
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
Feed Scheduling
Data Governance
● Dependency Graph
● Lineage - Data flow
○Current
○Historical
● SLA - Data monitoring
Capabilities
Dependency Graph
● Relationship existing between various elements in a pipeline.
● Entity Dependency
● Instance Dependency
● Instance - one run of a scheduled entity
● Feed instance
● Process instance
Entity dependency
● It returns the graph depicting the relationship between the various
processes and feeds in a given pipeline.
Click Process
Impression
Process
Click Enriched
Process
Summary
Click Feed
Impression
Feed
Enriched
Feed
Produces
Consumes
Produces
Produces
Consumes
Consumes
Instance dependency
Gives information about the producer and consumer for a particular instance.
Click
Enriched
Process 02:
00
Summary 02:
00
Click Feed 02:00
Impression Feed
02:00
Click Enriched
Feed 02:00
Produces
Consumes
Consumes
Consumes
Billing
Process
02:00
Consumes
Billing Feed 02:00
Produces
Consumes
Lineage
● Logical flow/movement of data from source
to destination.
● Ability to filter out data on certain attributes
of the entities
● Captures the metadata tags associated with
each of the entities as relationships.
● Falcon adds the ability to capture lineage for
both entities and its associated instances.
● Implemented using Graph DB
Graph DB
● DAG notation.
● Stores all CRUD operations.
● Used to store information about entities , instances and their
dependencies.
● Graph APIs exposed to analyse metrics.
● Example : Pipeline Health check.
Output
Entity Graph
Instance Graph
How to query...
Lineage data can be queried in 3 ways:
● REST API
● CLI
● Dashboard (upcoming)
SLA Monitoring
● Feed - Alerts based on data Availability
● Process - Triggering of process instances.
● Two types of SLA
● SLA Low
● SLA High
● Dashboard
● Pluggable Alerting System
● Email
● JMS Notifications
<feed description="click logs" name="Yoda-Minutely" xmlns="uri:falcon:
feed:0.1">
<frequency>minutes(1)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="minutes(3)" slaHigh="minutes(5)"/>
<clusters>
<cluster name="uh1-gold" type="source">
<validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/tmp/falcon-
regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR}
/${MINUTE}"/>
</locations>
…..
</feed>
Feed SLA
Case Study
Distributed Processing
Clusters @ InMobi
● Hadoop usage at InMobi
● ~ 6 Clusters
● > 1PB of storage
● > 5TB new data ingested each day
● > 20TB data crunched each day
● > 200 nodes in HDFS/MR clusters & > 40 nodes
in Hbase
● > 175K hadoop jobs / day
● > 60K Oozie workflows / day
● 300+ Falcon feed definitions
● 100+ Falcon process definitions
Processing – Single Data Center
Global Aggregation
et cetera...
● Security
○ Authentication and Authorization.
○ Ability to interface with secure endpoints.
● Recipes
○ A template which multiple process can use.
○ HiveDR, HDFS Replication recipes come OOB.
● Falcon Unit
○ Unit test framework for testing pipelines.
● Triage
● Audit
Other Notable Features
What else is out there?
gobblin - https://github.com/linkedin/gobblin
nifi - https://nifi.apache.org/
Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie-
birender-saini
● Pluggable lifecycle
○Data Acquisition as a Service.
● A more powerful scheduler
○pipeline recovery
○data availability/manual trigger support
● Improved application packaging/deployment.
● Better UI
● Improved monitoring
Towards the Future
Eager for more?
Learn - http://falcon.apache.org/
Discuss - user@falcon.apache.org, dev@falcon.apache.org
Contribute - http://github.com/apache/falcon
http://issues.apache.org/jira/browse/FALCON

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 

Kürzlich hochgeladen (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 

Empfohlen

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Empfohlen (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Let your data processing take flight with Apache Falcon

  • 1. Let your Big Data Processing take flight with Apache Falcon {pallavi.rao, pragya.mittal}@inmobi.com
  • 2. $whoami ❖ Pallavi Rao ➢ Architect, InMobi ➢ Committer, Apache Falcon ➢ Contributor, Apache PIG ❖ Pragya Mittal ➢ Engineer, InMobi ➢ Committer, Apache Falcon
  • 3. What is in store for you? ● Some history ● Motivation for Apache Falcon ● Overview of Apache Falcon ● How Apache Falcon is used @ InMobi ● Roadmap ● Q&A
  • 4. Once upon a time…
  • 5. Life was simple... Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required ● Cron jobs to delete/archive data. ● Cron jobs to copy data from one cluster to another. ● Email notifications and manual retries in case of failures.
  • 6. But, didn’t remain that way... ❏ Failures ❏ Data arriving late ❏ Re-processing ❏ Varied Data Replication ❏ Varied Data Retention ❏ Data Archival ❏ Lineage ❏ SLA monitoring
  • 8. Problem areas Data Management Data Governance Process Management ● Relays ● Late Data Handling ● Failure Retries ● Reruns ● Data Import/Export ● Retention ● Replication ● Archival ● Lineage ● Audit ● SLA
  • 11. Sample pipeline in Falcon Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required Falcon Feed Falcon Process CLUSTER
  • 12. Cluster Specification <cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>consumer=consumer@xyz.com, owner=producer@xyz.com, _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties> </cluster> How to access data Where to execute HCat Falcon cache User defined props
  • 15. Process Relays ● Data dependency among the process in the pipeline. ● Can wait on imported data or replicated data.
  • 16. Late Data Arrival ● Defines how late data is handled. ● How long to wait for the data and how to check for late data. ● Optional separate logic to handle late data processing.
  • 17. Process Specification <process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib=" /user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process> </process> Where should the process run? How should the process run? What to consume? What to produce? Processing logic Late Data processing
  • 20. Data Retention As a Service ● Data grows rapidly and cannot be kept around for ever. Needs to be deleted or archived. ● Retention policy is dictated by the kind of data.
  • 21. Data Replication As a Service ● Replication required for ○ Disaster Recovery. ○Local/Global Aggregation. ● Configurable resource consumption.
  • 22. Feed Specification <feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH} /${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> ….. </feed> Frequency Location SLA Monitoring Data Retention Data Replication
  • 25. ● Dependency Graph ● Lineage - Data flow ○Current ○Historical ● SLA - Data monitoring Capabilities
  • 26. Dependency Graph ● Relationship existing between various elements in a pipeline. ● Entity Dependency ● Instance Dependency ● Instance - one run of a scheduled entity ● Feed instance ● Process instance
  • 27. Entity dependency ● It returns the graph depicting the relationship between the various processes and feeds in a given pipeline. Click Process Impression Process Click Enriched Process Summary Click Feed Impression Feed Enriched Feed Produces Consumes Produces Produces Consumes Consumes
  • 28. Instance dependency Gives information about the producer and consumer for a particular instance. Click Enriched Process 02: 00 Summary 02: 00 Click Feed 02:00 Impression Feed 02:00 Click Enriched Feed 02:00 Produces Consumes Consumes Consumes Billing Process 02:00 Consumes Billing Feed 02:00 Produces Consumes
  • 29. Lineage ● Logical flow/movement of data from source to destination. ● Ability to filter out data on certain attributes of the entities ● Captures the metadata tags associated with each of the entities as relationships. ● Falcon adds the ability to capture lineage for both entities and its associated instances. ● Implemented using Graph DB
  • 30. Graph DB ● DAG notation. ● Stores all CRUD operations. ● Used to store information about entities , instances and their dependencies. ● Graph APIs exposed to analyse metrics. ● Example : Pipeline Health check.
  • 33. How to query... Lineage data can be queried in 3 ways: ● REST API ● CLI ● Dashboard (upcoming)
  • 34. SLA Monitoring ● Feed - Alerts based on data Availability ● Process - Triggering of process instances. ● Two types of SLA ● SLA Low ● SLA High ● Dashboard ● Pluggable Alerting System ● Email ● JMS Notifications
  • 35. <feed description="click logs" name="Yoda-Minutely" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(1)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="minutes(3)" slaHigh="minutes(5)"/> <clusters> <cluster name="uh1-gold" type="source"> <validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/> <retention limit="days(2)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/tmp/falcon- regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR} /${MINUTE}"/> </locations> ….. </feed> Feed SLA
  • 37. Clusters @ InMobi ● Hadoop usage at InMobi ● ~ 6 Clusters ● > 1PB of storage ● > 5TB new data ingested each day ● > 20TB data crunched each day ● > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase ● > 175K hadoop jobs / day ● > 60K Oozie workflows / day ● 300+ Falcon feed definitions ● 100+ Falcon process definitions
  • 38. Processing – Single Data Center
  • 41. ● Security ○ Authentication and Authorization. ○ Ability to interface with secure endpoints. ● Recipes ○ A template which multiple process can use. ○ HiveDR, HDFS Replication recipes come OOB. ● Falcon Unit ○ Unit test framework for testing pipelines. ● Triage ● Audit Other Notable Features
  • 42. What else is out there? gobblin - https://github.com/linkedin/gobblin nifi - https://nifi.apache.org/ Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie- birender-saini
  • 43. ● Pluggable lifecycle ○Data Acquisition as a Service. ● A more powerful scheduler ○pipeline recovery ○data availability/manual trigger support ● Improved application packaging/deployment. ● Better UI ● Improved monitoring Towards the Future
  • 44. Eager for more? Learn - http://falcon.apache.org/ Discuss - user@falcon.apache.org, dev@falcon.apache.org Contribute - http://github.com/apache/falcon http://issues.apache.org/jira/browse/FALCON