SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
What Is Apache Airflow ?
● A work flow management platform
● Uses Python based work flows
● Schedule by time or event
● Open source Apache 2.0 license
● Written in Python
● Monitor work flows in UI
● Has a wide range of integration options
● Originally developed at Airbnb
What Is Apache Airflow ?
● Uses SqlLite as a back end DB but can use
– MySQL, Postgres, JDBC etc
● Install extra packages using pip command
– Wide variety available, includes
– Many databases, cloud services
– Hadoop eco system
– Security, web services, queues
– Many more
Airflow Pipelines
● These are Python based work flows
● Are actually directed acyclic graphs ( DAG's )
● Pipelines use Jinja templating
● Pipelines contain user defined tasks
● Tasks can run on different workers at different times
● Jinja scripts can be embedded in tasks
● Comments can be added in tasks in varying formats
● Inter task dependencies can be defined
Airflow Pipelines
Airflow Tasks
● Tasks have a lifecycle
● Tasks use operators to execute, depends upon type
– For instance MySqlOperator
● Hooks are used to access external systems i.e. databases
● Worker specific queues can be used for tasks
● Xcom allows tasks to exchange messages
● Pipelines or DAG's allow
– Branching
– Sub DAG's
– Service level agreements ( SLA )
– Triggering rules
Airflow Task Stages
● Tasks have life cycle stages
Airflow Task Life Cycle
Airflow UI
● Airflow UI provides views
– DAG, Tree, Graph, Variables, Gantt Chart
– Task duration, Code view
● Select a task instance in any view to manage
● Monitor and troubleshoot pipelines in views
● Monitor DAG's by owner, schedule, run time etc
● Use views to find pipeline problem areas
● Use views to find bottle necks
Airflow UI
Airflow Integration
● Airflow Integrates with
– Azure: Microsoft Azure
– AWS: Amazon Web Services
– Databricks
– GCP: Google Cloud Platform
– Cloud Speech Translate Operators
– Qubole
● Kubernetes
– Run tasks as pods
Airflow Metrics
● Airflow can send metrics to StatsD
– A network daemon that runs on Node.js
– Listens for statistics, like counters, gauges, timers
– Statistics sent over UDP or TCP
● Install metrics using pip command
● Specify which stats to record i.e.
– scheduler,executor,dagrun
Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Weitere ähnliche Inhalte

Mehr von Mike Frampton

An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
Mike Frampton
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
Mike Frampton
 

Mehr von Mike Frampton (20)

Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
An example Hadoop Install
An example Hadoop InstallAn example Hadoop Install
An example Hadoop Install
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
An Introduction to Cloud Computing
An Introduction to Cloud ComputingAn Introduction to Cloud Computing
An Introduction to Cloud Computing
 
An Introduction to Hadoop Hue Gui
An Introduction to Hadoop Hue GuiAn Introduction to Hadoop Hue Gui
An Introduction to Hadoop Hue Gui
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache Hadoop
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Apache Airflow

  • 1. What Is Apache Airflow ? ● A work flow management platform ● Uses Python based work flows ● Schedule by time or event ● Open source Apache 2.0 license ● Written in Python ● Monitor work flows in UI ● Has a wide range of integration options ● Originally developed at Airbnb
  • 2. What Is Apache Airflow ? ● Uses SqlLite as a back end DB but can use – MySQL, Postgres, JDBC etc ● Install extra packages using pip command – Wide variety available, includes – Many databases, cloud services – Hadoop eco system – Security, web services, queues – Many more
  • 3. Airflow Pipelines ● These are Python based work flows ● Are actually directed acyclic graphs ( DAG's ) ● Pipelines use Jinja templating ● Pipelines contain user defined tasks ● Tasks can run on different workers at different times ● Jinja scripts can be embedded in tasks ● Comments can be added in tasks in varying formats ● Inter task dependencies can be defined
  • 5. Airflow Tasks ● Tasks have a lifecycle ● Tasks use operators to execute, depends upon type – For instance MySqlOperator ● Hooks are used to access external systems i.e. databases ● Worker specific queues can be used for tasks ● Xcom allows tasks to exchange messages ● Pipelines or DAG's allow – Branching – Sub DAG's – Service level agreements ( SLA ) – Triggering rules
  • 6. Airflow Task Stages ● Tasks have life cycle stages
  • 8. Airflow UI ● Airflow UI provides views – DAG, Tree, Graph, Variables, Gantt Chart – Task duration, Code view ● Select a task instance in any view to manage ● Monitor and troubleshoot pipelines in views ● Monitor DAG's by owner, schedule, run time etc ● Use views to find pipeline problem areas ● Use views to find bottle necks
  • 10. Airflow Integration ● Airflow Integrates with – Azure: Microsoft Azure – AWS: Amazon Web Services – Databricks – GCP: Google Cloud Platform – Cloud Speech Translate Operators – Qubole ● Kubernetes – Run tasks as pods
  • 11. Airflow Metrics ● Airflow can send metrics to StatsD – A network daemon that runs on Node.js – Listens for statistics, like counters, gauges, timers – Statistics sent over UDP or TCP ● Install metrics using pip command ● Specify which stats to record i.e. – scheduler,executor,dagrun
  • 12. Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  • 13. Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration