SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Production-Ready Flink and Hive Integration
- what story you can tell now
Bowen Li, Committer@Flink
Flink Forward SF, April 2020
Agenda
● Background
● Flink 1.10 - State of Union
● Flink 1.11 - Upcoming Features
● Stories You’d be able to tell now
● Q&A
Background
● Apache Flink is a framework and distributed processing engine for stateful
computations over unbounded (streaming) and bounded (batch) data streams.
● Flink’s philosophy: Batch is a special case of Streaming
Background
We annouced Flink + Hive integration as beta in Flink 1.9 to aim at
● Help to shift businesses to more real-time fashion
● Bring Flink’s cutting-edge Streaming capabilities to Batch
● Unify companies’ tech stack of online and offline data infrastructure
● Simplify deployments and operations
● Lower entry bar and learning cost for end users
○ developers, data scientists, analysts, etc
Flink 1.10 - State of the Union
more on https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/
Unified Metadata Management
Background
● Hive Metastore (HMS) has become the de facto metadata hub in the Hadoop
● Many companies have a single HMS to manage all of their metadata, either Hive or
non-Hive metadata, as the single source of truth
Unified Metadata Management
Flink’s HiveCatalog
● introduced in Flink 1.9
● production-ready in Flink 1.10
○ compatible with almost all HMS versions
○ can store Flink’s own metadata of tables, views, UDFs, statistics in HMS
○ also being the bridge for Flink to leverage Hive’s own metadata
HiveCatalog E2E Example
Example to store a Flink’s Kafka table thru HiveCatalog into HMS
and later query the table with Flink SQL
HiveCatalog E2E Example
execution:
planner: blink
type: streaming
...
current-catalog: myhive
current-database: mydatabase
catalogs:
- name: myhive
type: hive
hive-conf-dir: /opt/hive-conf # contains hive-site.xml
Step 1: Configure the HiveCatalog in Flink SQL CLI
HiveCatalog E2E Example
Flink SQL> CREATE TABLE mykafka (name String, age Int) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'test',
'connector.properties.zookeeper.connect' = 'localhost:2181',
'connector.properties.bootstrap.servers' = 'localhost:9092',
'format.type' = 'csv',
'update-mode' = 'append'
);
[INFO] Table has been created.
Step 2: Create a Kafka table in HMS from Flink SQL CLI
HiveCatalog E2E Example
hive> describe formatted mykafka;
Database: default
Table Parameters:
flink.connector.properties.bootstrap.servers localhost:9092
flink.connector.properties.zookeeper.connect localhost:2181
flink.connector.topic test
flink.connector.type kafka
flink.format.type csv
flink.generic.table.schema.0.data-typeVARCHAR(2147483647)
flink.generic.table.schema.0.name name
flink.generic.table.schema.1.data-typeINT
flink.generic.table.schema.1.name age
....
Step 3: Describe the Kafka table in HMS via Hive CLI
HiveCatalog E2E Example
Flink SQL> select * from mykafka;
SQL Query Result (Table)
Refresh: 1 s Page: Last of 1
name age
tom 15
john 21
kitty 30
amy 24
kaiky 18
Step 4: Query the Kafka table in Flink SQL CLI
● Flink 1.9 - Hive 2.3.4 and 1.2.1
● Flink 1.10 - Most Hive versions include
○ 1.0, 1.1, 1.2
○ 2.0, 2.1, 2.2, 2.3
○ 3.1
Supported Hive Versions
● Flink 1.9: Supports reusing all Hive UDFs
● Flink 1.10: Support reusing all Hive built-in functions
○ Hive has developed a few hundreds of super handy built-in functions over the years
○ supported via Module mechanism, introduced in 1.10 to improve Flink SQL’s extensibility
■ e.g. lay the road to custom data types in, say, geo and machine learning
○ provided by HiveModule
Reuse Hive Functions
● Flink 1.9
○ read non-partitioned and partitioned Hive tables
● Flink 1.10
○ read Hive views
○ vectorized reader of ORC files
○ enhanced optimizations:
■ partition-pruning
■ projection pushdown
■ limit pushdown
Enhanced Read of Hive Data
● Flink 1.9
○ write to non-partitioned Hive tables
● Flink 1.10
○ write to partitioned tables, both static and dynamic partitions
○ support “INSERT OVERWRITE” syntax
Enhanced Write of Hive Data
● Hive near-real-time streaming sink
○ bring a bit more real-time experience to Hive
● Native Parquet reader for better performance
● Better interoperabilities
○ How about creating Hive tables, views, functions within Flink? :D
What’s coming in Flink 1.11
● Better out-of-box dependency management
○ Yes, we are fully aware of the mess of Hive’s dependencies
● JDBC Driver/Gateway so users can reuse existing tools to run SQL jobs on Flink
● Support a bit more Hive syntax
What’s coming in Flink 1.11
What use cases do these Flink capabilities enable?
Story 1 - Unify Compute Infrastructure
Online Streams Results
Compute
Storage
Story 2 - Join real-time data with offline data
Results
MQ
* Only if the offline data volume is small to medium
Otherwise, your Flink will stress out by keeping all history data in state
Story 3 - Backfill data on demand or on failures
ResultsMQ
Raw data
ResultsMQ
Raw data
1) Business logic need to evolve!
2) Online job failed!
ResultsMQ
Raw data
1) Business logic need to evolve!
2) Online job failed!
In the past, you woud need a 2nd
Hive job to do the backfill
Results
ResultsMQ
Raw data
● Use the same job and code to reprocess
historical data on demand
● No need to maintain a separate Hive job
Online pipeline resumes and continue from the latest record
because you don’t want to it to process piled-up data to
increase pipeline latency and further sacrifice SLAs
MQ
Story 4 - Near-real-time data ingestion to Hive
(upcoming)
File-based streaming sink that can
commit to Hive every 5-10 min
MQ
Near-real-time data ingestion to Hive
File-based streaming sink that can
commit to Hive every 5-10 min
Is this too good to be true, or is there any pitfalls?
A general question at the end though:
Is traditional data warehouses or data lake really a good fit for
streaming-native engine like Flink?
A Few Unsolved Problems - Example 1
What if you want to join real-time data with huge amount of historical data?
● You can’t put into Flink as the volume is too big for Flink state to handle
● Maybe point-lookup?
○ Well, traditional dw/data lake don’t support point-lookup, they only scan
● Maybe bring in Cassandra/HBase/Redis/…..?
○ You’re indeed going the opposite direction of unification
by diligently adding one additional infra at a time
Results
MQ
* Only if the offline data volume is small to medium
Otherwise, your Flink will stress out by keeping all history data in state
MQ
A Few Unsolved Problems:Example 2
File-based streaming sink that can commit to
Hive every 5-10 min
Does near-real-time come without a limit and a price?
● Pipeline latency cannot go below 5-10mins, otherwise too many small files are created
○ any latency you’ve saved in Kafka and Flink is caped by the file commit interval in dw/dl
● Still a lot of small files even Flink commits every 5-10mins. How do you handle them?
○ maybe commit every 1h? please, how dare you call it near-real-time when there’s 1h delay
○ maybe have some hourly + daily Airflow jobs? Good luck going back to batch world
Is there a Solution?
Yes, we are launching a new purpose-built product Hologres
check out the session presented by Xiaowei Jiang
“Data Warehouse, Data Lakes, What’s Next?”
April 23 11am-11:40am PDT, Room5
Is there a Solution?
● Flink 1.10 brings production-read integration with Hive on both metadata
management and data handling sides
● There are still many unsolved challenges when integrating Flink with
traditional data warehouse and data lakes
Conclusions
Thanks!
Q&A
Twitter: @Bowen__Li

Weitere ähnliche Inhalte

Was ist angesagt?

Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Flink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 

Was ist angesagt? (20)

Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
 
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
A look at Flink 1.2
A look at Flink 1.2A look at Flink 1.2
A look at Flink 1.2
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
 
data Artisans Product Announcement
data Artisans Product Announcementdata Artisans Product Announcement
data Artisans Product Announcement
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 

Ähnlich wie Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - what story you can tell now? - Bowen LI

Ähnlich wie Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - what story you can tell now? - Bowen LI (20)

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
 
Integrating Flink with Hive - Flink Forward SF 2019
Integrating Flink with Hive - Flink Forward SF 2019Integrating Flink with Hive - Flink Forward SF 2019
Integrating Flink with Hive - Flink Forward SF 2019
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systems
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
 
Apache flink
Apache flinkApache flink
Apache flink
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 

Mehr von Flink Forward

Mehr von Flink Forward (20)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - what story you can tell now? - Bowen LI

  • 1. Production-Ready Flink and Hive Integration - what story you can tell now Bowen Li, Committer@Flink Flink Forward SF, April 2020
  • 2. Agenda ● Background ● Flink 1.10 - State of Union ● Flink 1.11 - Upcoming Features ● Stories You’d be able to tell now ● Q&A
  • 3. Background ● Apache Flink is a framework and distributed processing engine for stateful computations over unbounded (streaming) and bounded (batch) data streams. ● Flink’s philosophy: Batch is a special case of Streaming
  • 4. Background We annouced Flink + Hive integration as beta in Flink 1.9 to aim at ● Help to shift businesses to more real-time fashion ● Bring Flink’s cutting-edge Streaming capabilities to Batch ● Unify companies’ tech stack of online and offline data infrastructure ● Simplify deployments and operations ● Lower entry bar and learning cost for end users ○ developers, data scientists, analysts, etc
  • 5. Flink 1.10 - State of the Union more on https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/
  • 6. Unified Metadata Management Background ● Hive Metastore (HMS) has become the de facto metadata hub in the Hadoop ● Many companies have a single HMS to manage all of their metadata, either Hive or non-Hive metadata, as the single source of truth
  • 7. Unified Metadata Management Flink’s HiveCatalog ● introduced in Flink 1.9 ● production-ready in Flink 1.10 ○ compatible with almost all HMS versions ○ can store Flink’s own metadata of tables, views, UDFs, statistics in HMS ○ also being the bridge for Flink to leverage Hive’s own metadata
  • 8. HiveCatalog E2E Example Example to store a Flink’s Kafka table thru HiveCatalog into HMS and later query the table with Flink SQL
  • 9. HiveCatalog E2E Example execution: planner: blink type: streaming ... current-catalog: myhive current-database: mydatabase catalogs: - name: myhive type: hive hive-conf-dir: /opt/hive-conf # contains hive-site.xml Step 1: Configure the HiveCatalog in Flink SQL CLI
  • 10. HiveCatalog E2E Example Flink SQL> CREATE TABLE mykafka (name String, age Int) WITH ( 'connector.type' = 'kafka', 'connector.version' = 'universal', 'connector.topic' = 'test', 'connector.properties.zookeeper.connect' = 'localhost:2181', 'connector.properties.bootstrap.servers' = 'localhost:9092', 'format.type' = 'csv', 'update-mode' = 'append' ); [INFO] Table has been created. Step 2: Create a Kafka table in HMS from Flink SQL CLI
  • 11. HiveCatalog E2E Example hive> describe formatted mykafka; Database: default Table Parameters: flink.connector.properties.bootstrap.servers localhost:9092 flink.connector.properties.zookeeper.connect localhost:2181 flink.connector.topic test flink.connector.type kafka flink.format.type csv flink.generic.table.schema.0.data-typeVARCHAR(2147483647) flink.generic.table.schema.0.name name flink.generic.table.schema.1.data-typeINT flink.generic.table.schema.1.name age .... Step 3: Describe the Kafka table in HMS via Hive CLI
  • 12. HiveCatalog E2E Example Flink SQL> select * from mykafka; SQL Query Result (Table) Refresh: 1 s Page: Last of 1 name age tom 15 john 21 kitty 30 amy 24 kaiky 18 Step 4: Query the Kafka table in Flink SQL CLI
  • 13. ● Flink 1.9 - Hive 2.3.4 and 1.2.1 ● Flink 1.10 - Most Hive versions include ○ 1.0, 1.1, 1.2 ○ 2.0, 2.1, 2.2, 2.3 ○ 3.1 Supported Hive Versions
  • 14. ● Flink 1.9: Supports reusing all Hive UDFs ● Flink 1.10: Support reusing all Hive built-in functions ○ Hive has developed a few hundreds of super handy built-in functions over the years ○ supported via Module mechanism, introduced in 1.10 to improve Flink SQL’s extensibility ■ e.g. lay the road to custom data types in, say, geo and machine learning ○ provided by HiveModule Reuse Hive Functions
  • 15. ● Flink 1.9 ○ read non-partitioned and partitioned Hive tables ● Flink 1.10 ○ read Hive views ○ vectorized reader of ORC files ○ enhanced optimizations: ■ partition-pruning ■ projection pushdown ■ limit pushdown Enhanced Read of Hive Data
  • 16. ● Flink 1.9 ○ write to non-partitioned Hive tables ● Flink 1.10 ○ write to partitioned tables, both static and dynamic partitions ○ support “INSERT OVERWRITE” syntax Enhanced Write of Hive Data
  • 17. ● Hive near-real-time streaming sink ○ bring a bit more real-time experience to Hive ● Native Parquet reader for better performance ● Better interoperabilities ○ How about creating Hive tables, views, functions within Flink? :D What’s coming in Flink 1.11
  • 18. ● Better out-of-box dependency management ○ Yes, we are fully aware of the mess of Hive’s dependencies ● JDBC Driver/Gateway so users can reuse existing tools to run SQL jobs on Flink ● Support a bit more Hive syntax What’s coming in Flink 1.11
  • 19. What use cases do these Flink capabilities enable?
  • 20. Story 1 - Unify Compute Infrastructure Online Streams Results Compute Storage
  • 21. Story 2 - Join real-time data with offline data Results MQ * Only if the offline data volume is small to medium Otherwise, your Flink will stress out by keeping all history data in state
  • 22. Story 3 - Backfill data on demand or on failures ResultsMQ Raw data
  • 23. ResultsMQ Raw data 1) Business logic need to evolve! 2) Online job failed!
  • 24. ResultsMQ Raw data 1) Business logic need to evolve! 2) Online job failed! In the past, you woud need a 2nd Hive job to do the backfill Results
  • 25. ResultsMQ Raw data ● Use the same job and code to reprocess historical data on demand ● No need to maintain a separate Hive job Online pipeline resumes and continue from the latest record because you don’t want to it to process piled-up data to increase pipeline latency and further sacrifice SLAs
  • 26. MQ Story 4 - Near-real-time data ingestion to Hive (upcoming) File-based streaming sink that can commit to Hive every 5-10 min
  • 27. MQ Near-real-time data ingestion to Hive File-based streaming sink that can commit to Hive every 5-10 min Is this too good to be true, or is there any pitfalls?
  • 28. A general question at the end though: Is traditional data warehouses or data lake really a good fit for streaming-native engine like Flink?
  • 29. A Few Unsolved Problems - Example 1 What if you want to join real-time data with huge amount of historical data? ● You can’t put into Flink as the volume is too big for Flink state to handle ● Maybe point-lookup? ○ Well, traditional dw/data lake don’t support point-lookup, they only scan ● Maybe bring in Cassandra/HBase/Redis/…..? ○ You’re indeed going the opposite direction of unification by diligently adding one additional infra at a time Results MQ * Only if the offline data volume is small to medium Otherwise, your Flink will stress out by keeping all history data in state
  • 30. MQ A Few Unsolved Problems:Example 2 File-based streaming sink that can commit to Hive every 5-10 min Does near-real-time come without a limit and a price? ● Pipeline latency cannot go below 5-10mins, otherwise too many small files are created ○ any latency you’ve saved in Kafka and Flink is caped by the file commit interval in dw/dl ● Still a lot of small files even Flink commits every 5-10mins. How do you handle them? ○ maybe commit every 1h? please, how dare you call it near-real-time when there’s 1h delay ○ maybe have some hourly + daily Airflow jobs? Good luck going back to batch world
  • 31. Is there a Solution?
  • 32. Yes, we are launching a new purpose-built product Hologres check out the session presented by Xiaowei Jiang “Data Warehouse, Data Lakes, What’s Next?” April 23 11am-11:40am PDT, Room5 Is there a Solution?
  • 33. ● Flink 1.10 brings production-read integration with Hive on both metadata management and data handling sides ● There are still many unsolved challenges when integrating Flink with traditional data warehouse and data lakes Conclusions