SlideShare ist ein Scribd-Unternehmen logo
1 von 23
1 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
DISASTER RECOVERY AND CLOUD
MIGRATION FOR YOUR APACHE
HIVE WAREHOUSE
Sankar Hariappan
Senior Software Engineer, Hortonworks
2 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
 About Apache Hive
 Disaster Recovery
 Replication Modes
 Fail Over
 Fail Back
 Replication at Hive-Scale
 Event Based Replication
 Change Management
 Bootstrapping
 REPL Commands
 Demonstration
 Cloud Migration Challenges
 Future Work
Agenda
3 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
About Apache Hive
 Data warehouse tool built on top of Apache Hadoop.
 Handle data warehousing tasks such as extract/transform/load (ETL), reporting, and
data analysis.
 Manage large datasets residing in distributed storage.
 SQL with Hive specific extensions.
 Query optimization powered by Apache Calcite and execution via Apache Tez, Apache
Spark, or MapReduce.
 Access to files stored either directly in Apache HDFS or in other data storage systems
such as Apache HBase.
 Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
4 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Apache Hive Architecture
Hive Thrift
Server
JDBC/ODBC
Driver
Compiler Optimizer Executor
HiveServer2
Hive
Metastore
HDFS
YARN
MS
Client RDBMS
5 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Disaster Recovery
 Deployment of clusters in more than one data center for business continuity or geo
localization.
 Hybrid cloud deployment for off-premise processing.
 Robust replication solution to achieve seamless disaster recovery.
– Prevent severe data loss.
– Eliminate single point of failure.
– Fault-tolerant.
6 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication Modes
 Master-Slave
Master Slave
Unidirectional
Read ReadWrite
 Master-Master
Master
Bidirectional
Read Write
Master
Read Write
7 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication Modes
 Hub and Spoke pattern
Master
Slave
Read
Read
Write
Slave
Read
Slave
Read
Slave
Read
 Relay pattern
Master Slave
Read ReadWrite
Slave
Read
8 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Fail Over
 Slave take over the Master
responsibilities instantaneously.
 Ensure business continuity with minimal
data loss based on Recovery Point
Objective (RPO).
 Almost zero down-time.
Master Slave
Unidirectional
Read Write
Fail over
Read Write
9 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Fail Back
 Slave cluster usually have minimal
processing capabilities which makes Fail
Back an important requirement.
 Original Master comes alive with latest
data.
 Ensure removal of stale data which was
not replicated to the Slave.
 Reverse replicate the delta of data
loaded into the Slave after Fail Over.
Master Slave
Unidirectional
Read ReadWrite
Fail back
10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Replication at Hive-Scale
 Event based replication.
 First version of Hive Replication (Replv1) uses EXPORT-IMPORT semantics to replicate
data.
– Inefficient mechanism.
– 4X copy problem.
– Rubber-banding issue.
– Depends on external tools such as Falcon/Oozie to manage replication state.
 Second version of Hive Replication (Replv2) uses REPL commands.
– Point-in time replication.
– Reduce number of copies.
– Hive maintains the replicated state.
– Additional support for functions, constraint replication.
11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Logging
HiveServer2
Hive
Metastore
Metastore
RDBMS
Events Table
JDBC/ODBC
Runs Query Manage Metadata
12 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Logging
 Capture event: Create/Alter/Drop on DB/Table/Partition/Function/Constraint objects.
 Stored in Metastore RDBMS.
 Event is self-contained to recover the state of the object (metadata + data).
 Events are serialized using sequence number (event id).
13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Based Replication
Metastore
RDBMS
Events Table
HDFS
Serialize new events
batch
Master Cluster
Slave Cluster
HiveServer2
Dump
(metadata + data)
HDFS
Meatastore
RDBMS
HiveServer2
DistcpMetastore API to
write objects
Data files
copy
Read repl
dump dir
REPL DUMP
REPL LOAD
14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Based Replication
 Read batch of events from the Metastore RDBMS in the generated sequence.
 "repl dump <db name> from <event id> "
– get events newer than <event id>.
– includes data files information.
– "<event id>" is last replicated event id for DB from the destination cluster
 "repl load <db name> from <hdfs URI>"
– apply the events on destination
 State replicated in batches currently, can be optimized in future
15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Change Management
 Replicating the following batch –
– Insert to table
– Drop table
 Need inserted files after drop for replication
 Trash like directory for capturing such files (CM dir)
 Use checksum to verify file, else lookup from CM dir using checksum
 Necessary for ordered replication - State in destination DB would correspond to state in
source X duration back.
16 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Bootstrapping
 What about data generated before event capturing was enabled?
 Bootstrapping - Uses same repl dump/load commands, but is not event based
 Incremental replication catches up with events during bootstrap to make change
consistent with state of source at time X in past.
 Optimized for large database.
 Parallel dump of large number of partitions.
17 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
REPL Commands
 REPL DUMP <db-name> [FROM <start-evid> [TO <end-evid>] [LIMIT <num-evids>] ];
– Execute this command in source cluster.
– REPL DUMP <db-name>; bootstrap the whole database.
– REPL DUMP <db-name> FROM <start-evid>; to replicate all events after start-evid.
– REPL DUMP <db-name> FROM <start-evid> TO <end-evid>; to replicate a range of events.
– REPL DUMP <db-name> FROM <start-evid> LIMIT <num-evids>; to replicate a limited set of events.
 REPL LOAD <db-name> FROM <dump-dir>;
– dump-dir is the HDFS URI returned by REPL DUMP command.
– Execute this command in destination cluster.
 REPL STATUS <db-name>;
– Execute this command in destination cluster.
– Gets the last replicated state of the database in destination which should be the input for REPL
DUMP as start-evid.
18 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Demonstration
19 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Cloud Migration Challenges
 Move is expensive
– Cloud file systems has implemented “move” as “copy”.
– Repl load does atomic move/rename from temp directory to warehouse location.
– ACID and micro-managed tables can potentially help avoiding the move operation.
 On-Prem to Cloud
– Shall run distcp from on-prem cluster to avoid resource overhead on cloud.
– Need to depend on checksum of source files to verify the copied files.
 Cloud to Cloud
– Optimize distcp to use vendor specific tool to copy between cloud file systems.
– Checksum is not consistent/available on all filesystem.
20 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Future Work
 Replicate ACID/Micro-managed tables.
 Replication to/from cloud storage such as S3 or WASB etc.
 Hot Data Replication.
 Faster Bootstrapping
 Optimize Fail Back.
 Replicate Column Statistics, Index etc
 Table level replication.
21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
References: REPL Configurations
 hive.metastore.transactional.event.listeners =org.apache.hive.hcatalog.listener.DbNotificationListener
(Capture events)
 hive.repl.rootdir (Root directory used by repl dump)
 hive.metastore.dml.events=true (Enable event generation for DML operations)
 hive.repl.cm.enabled=true (Change Manager to be enabled in source cluster)
 hive.repl.cm.retain=24hr (Expiry time for CM backed-up data files)
 hive.repl.cm.interval=3600s (Time interval to validate expired data files in CM)
 hive.repl.cmrootdir (Root directory for Change Manager)
 hive.repl.replica.functions.root.dir (Root directory to store UDFs/UDAFs jars)
 hive.repl.approx.max.load.tasks=1000 (Limit the DAG size to control the memory consumption)
 hive.repl.partitions.dump.parallelism=5 (Number of threads to concurrently dump partitions)
22 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
References: Hive Doc
https://cwiki.apache.org/confluence/display/Hive/Home
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationDevelopment
https://cwiki.apache.org/confluence/display/Hive/Replication
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
https://issues.apache.org/jira/browse/HIVE-14841
23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 

Was ist angesagt? (20)

Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 

Ähnlich wie Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseSankar H
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSankar H
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)Abdelkrim Hadjidj
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoopGergely Devenyi
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualizationFranck Pachot
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Active Active Data Lake with ECS
Active Active Data Lake with ECSActive Active Data Lake with ECS
Active Active Data Lake with ECSClaudioFahey1
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
SAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP Technology
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 

Ähnlich wie Disaster Recovery and Cloud Migration for your Apache Hive Warehouse (20)

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Optimized Hive replication
Optimized Hive replicationOptimized Hive replication
Optimized Hive replication
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Active Active Data Lake with ECS
Active Active Data Lake with ECSActive Active Data Lake with ECS
Active Active Data Lake with ECS
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
SAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information Management
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

  • 1. 1 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved DISASTER RECOVERY AND CLOUD MIGRATION FOR YOUR APACHE HIVE WAREHOUSE Sankar Hariappan Senior Software Engineer, Hortonworks
  • 2. 2 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved  About Apache Hive  Disaster Recovery  Replication Modes  Fail Over  Fail Back  Replication at Hive-Scale  Event Based Replication  Change Management  Bootstrapping  REPL Commands  Demonstration  Cloud Migration Challenges  Future Work Agenda
  • 3. 3 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved About Apache Hive  Data warehouse tool built on top of Apache Hadoop.  Handle data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.  Manage large datasets residing in distributed storage.  SQL with Hive specific extensions.  Query optimization powered by Apache Calcite and execution via Apache Tez, Apache Spark, or MapReduce.  Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase.  Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
  • 4. 4 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Apache Hive Architecture Hive Thrift Server JDBC/ODBC Driver Compiler Optimizer Executor HiveServer2 Hive Metastore HDFS YARN MS Client RDBMS
  • 5. 5 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Disaster Recovery  Deployment of clusters in more than one data center for business continuity or geo localization.  Hybrid cloud deployment for off-premise processing.  Robust replication solution to achieve seamless disaster recovery. – Prevent severe data loss. – Eliminate single point of failure. – Fault-tolerant.
  • 6. 6 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication Modes  Master-Slave Master Slave Unidirectional Read ReadWrite  Master-Master Master Bidirectional Read Write Master Read Write
  • 7. 7 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication Modes  Hub and Spoke pattern Master Slave Read Read Write Slave Read Slave Read Slave Read  Relay pattern Master Slave Read ReadWrite Slave Read
  • 8. 8 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Fail Over  Slave take over the Master responsibilities instantaneously.  Ensure business continuity with minimal data loss based on Recovery Point Objective (RPO).  Almost zero down-time. Master Slave Unidirectional Read Write Fail over Read Write
  • 9. 9 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Fail Back  Slave cluster usually have minimal processing capabilities which makes Fail Back an important requirement.  Original Master comes alive with latest data.  Ensure removal of stale data which was not replicated to the Slave.  Reverse replicate the delta of data loaded into the Slave after Fail Over. Master Slave Unidirectional Read ReadWrite Fail back
  • 10. 10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Replication at Hive-Scale  Event based replication.  First version of Hive Replication (Replv1) uses EXPORT-IMPORT semantics to replicate data. – Inefficient mechanism. – 4X copy problem. – Rubber-banding issue. – Depends on external tools such as Falcon/Oozie to manage replication state.  Second version of Hive Replication (Replv2) uses REPL commands. – Point-in time replication. – Reduce number of copies. – Hive maintains the replicated state. – Additional support for functions, constraint replication.
  • 11. 11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Logging HiveServer2 Hive Metastore Metastore RDBMS Events Table JDBC/ODBC Runs Query Manage Metadata
  • 12. 12 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Logging  Capture event: Create/Alter/Drop on DB/Table/Partition/Function/Constraint objects.  Stored in Metastore RDBMS.  Event is self-contained to recover the state of the object (metadata + data).  Events are serialized using sequence number (event id).
  • 13. 13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Based Replication Metastore RDBMS Events Table HDFS Serialize new events batch Master Cluster Slave Cluster HiveServer2 Dump (metadata + data) HDFS Meatastore RDBMS HiveServer2 DistcpMetastore API to write objects Data files copy Read repl dump dir REPL DUMP REPL LOAD
  • 14. 14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Based Replication  Read batch of events from the Metastore RDBMS in the generated sequence.  "repl dump <db name> from <event id> " – get events newer than <event id>. – includes data files information. – "<event id>" is last replicated event id for DB from the destination cluster  "repl load <db name> from <hdfs URI>" – apply the events on destination  State replicated in batches currently, can be optimized in future
  • 15. 15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Change Management  Replicating the following batch – – Insert to table – Drop table  Need inserted files after drop for replication  Trash like directory for capturing such files (CM dir)  Use checksum to verify file, else lookup from CM dir using checksum  Necessary for ordered replication - State in destination DB would correspond to state in source X duration back.
  • 16. 16 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Bootstrapping  What about data generated before event capturing was enabled?  Bootstrapping - Uses same repl dump/load commands, but is not event based  Incremental replication catches up with events during bootstrap to make change consistent with state of source at time X in past.  Optimized for large database.  Parallel dump of large number of partitions.
  • 17. 17 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved REPL Commands  REPL DUMP <db-name> [FROM <start-evid> [TO <end-evid>] [LIMIT <num-evids>] ]; – Execute this command in source cluster. – REPL DUMP <db-name>; bootstrap the whole database. – REPL DUMP <db-name> FROM <start-evid>; to replicate all events after start-evid. – REPL DUMP <db-name> FROM <start-evid> TO <end-evid>; to replicate a range of events. – REPL DUMP <db-name> FROM <start-evid> LIMIT <num-evids>; to replicate a limited set of events.  REPL LOAD <db-name> FROM <dump-dir>; – dump-dir is the HDFS URI returned by REPL DUMP command. – Execute this command in destination cluster.  REPL STATUS <db-name>; – Execute this command in destination cluster. – Gets the last replicated state of the database in destination which should be the input for REPL DUMP as start-evid.
  • 18. 18 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Demonstration
  • 19. 19 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Cloud Migration Challenges  Move is expensive – Cloud file systems has implemented “move” as “copy”. – Repl load does atomic move/rename from temp directory to warehouse location. – ACID and micro-managed tables can potentially help avoiding the move operation.  On-Prem to Cloud – Shall run distcp from on-prem cluster to avoid resource overhead on cloud. – Need to depend on checksum of source files to verify the copied files.  Cloud to Cloud – Optimize distcp to use vendor specific tool to copy between cloud file systems. – Checksum is not consistent/available on all filesystem.
  • 20. 20 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Future Work  Replicate ACID/Micro-managed tables.  Replication to/from cloud storage such as S3 or WASB etc.  Hot Data Replication.  Faster Bootstrapping  Optimize Fail Back.  Replicate Column Statistics, Index etc  Table level replication.
  • 21. 21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved References: REPL Configurations  hive.metastore.transactional.event.listeners =org.apache.hive.hcatalog.listener.DbNotificationListener (Capture events)  hive.repl.rootdir (Root directory used by repl dump)  hive.metastore.dml.events=true (Enable event generation for DML operations)  hive.repl.cm.enabled=true (Change Manager to be enabled in source cluster)  hive.repl.cm.retain=24hr (Expiry time for CM backed-up data files)  hive.repl.cm.interval=3600s (Time interval to validate expired data files in CM)  hive.repl.cmrootdir (Root directory for Change Manager)  hive.repl.replica.functions.root.dir (Root directory to store UDFs/UDAFs jars)  hive.repl.approx.max.load.tasks=1000 (Limit the DAG size to control the memory consumption)  hive.repl.partitions.dump.parallelism=5 (Number of threads to concurrently dump partitions)
  • 22. 22 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved References: Hive Doc https://cwiki.apache.org/confluence/display/Hive/Home https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development https://cwiki.apache.org/confluence/display/Hive/HiveReplicationDevelopment https://cwiki.apache.org/confluence/display/Hive/Replication https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport https://issues.apache.org/jira/browse/HIVE-14841
  • 23. 23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved THANK YOU!