HBase with Apache Spark POC Demo

•

0 gefällt mir•218 views

Chetan Khatri

Talk - HBase with Apache Spark POC Demo by Chetan Khatri

Daten & Analysen

HBase with Spark POC
Demo
21st Dec, 2016
Chetan Khatri - Global Lead of Data Science,
Accionlabs Inc.

Current Environment
● Ubuntu 16.04
● Apache Hadoop 2.7.3
● Apache HBase 1.2.4
● Apache Spark 2.0.2
● Scala 2.12
● SBT(Scala Build Tool) 0.13.13
Note: Everything is in Pseudo Distributed mode

Approach
1. Created Table at HBase
2. Generated Dummy Data in Spark as a RDD and Saved to HBase
3. Reading HBase Table as a HadoopRDD in Spark
4. Traversing HadoopRDD Elements and Printing it out.
5. Entire Spark Transformation coded in Scala
6. Generated Executable Uber(“Super”) Jar with SBT
7. Submitted Spark job using spark-submit

Deployment
1. Add SBT Assembly Plugin
a. addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
b. At Plugins.sbt file under Scala Project
2. Build Uber(“Super”) Jar
a. Sbt clean assembly
3. Submit Spark Job
a. bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
/home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar

TODO
1. Read HBase Table and Store HBaseRDD to Apache Hive with Spark.
a. Incremental load from HBase to Hive
b. SparkSQL / SparkRDD and Apache Drill Performance Benchmark for Reading
HBase.
2. Save Summarized Spark Aggregation to PostgreSQL

Assumed : High Level Architecture
Event Queuing Kafka Batch
Incremental load /
Transformations
Reporting Tool
(Dashboard)
Processed DW - Ad
hoc Analysis / 1st
level aggregation
Data Ingestion
Pipeline
RestfulAPI’s
Summarized DB - KPI
Reporting

Reference
Github: https://github.com/chetkhatri/SparkHbasePOC

Empfohlen

Build application using sbtsparrowAnalytics.com

Configuration managementLuca De Vitis

Elasticsearch features and ecosystemPavel Alexeev

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Running Spark on CloudQubole

Spark on YarnQubole

Spark Your Legacy (Spark Summit 2016)Tzach Zohar

Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue

Empfohlen

Build application using sbtsparrowAnalytics.com

Configuration managementLuca De Vitis

Elasticsearch features and ecosystemPavel Alexeev

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Running Spark on CloudQubole

Spark on YarnQubole

Spark Your Legacy (Spark Summit 2016)Tzach Zohar

Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue

Airflow and supervisorRafael Roman Otero

Big data workloads using Apache Sparkon HDInsightNilesh Gule

Sqoop on Spark for Data IngestionDataWorks Summit

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有Yoshiyasu SAEKI

20150627 bigdatalagethue

20161215 python pandas-spark四方山話Ryuji Tamagawa

Build, Deploy and Manage Your Releases Like a BossGlobalLogic Ukraine

SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue

Haskell Tooling WhirlwindSteven Shaw

Azure cli2.0Mohit Chhabra

COOKPADでのHadoop利用Tatsuya Sasaki

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Apache spark installation [autosaved]Shweta Patnaik

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Apache Cassandra and Apche SparkAlex Thompson

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Spark Working Environment in Windows OSUniversiti Technologi Malaysia (UTM)

Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)

Introduction to Apache Spark EcosystemBojan Babic

Playing with Hadoop (NPW2013)Søren Lund

Weitere ähnliche Inhalte

Was ist angesagt?

Airflow and supervisorRafael Roman Otero

Big data workloads using Apache Sparkon HDInsightNilesh Gule

Sqoop on Spark for Data IngestionDataWorks Summit

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有Yoshiyasu SAEKI

20150627 bigdatalagethue

20161215 python pandas-spark四方山話Ryuji Tamagawa

Build, Deploy and Manage Your Releases Like a BossGlobalLogic Ukraine

SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue

Haskell Tooling WhirlwindSteven Shaw

Azure cli2.0Mohit Chhabra

COOKPADでのHadoop利用Tatsuya Sasaki

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

Was ist angesagt? (12)

Airflow and supervisor

Big data workloads using Apache Sparkon HDInsight

Sqoop on Spark for Data Ingestion

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有

20150627 bigdatala

20161215 python pandas-spark四方山話

Build, Deploy and Manage Your Releases Like a Boss

SF Solr Meetup - Interactively Search and Visualize Your Big Data

Haskell Tooling Whirlwind

Azure cli2.0

COOKPADでのHadoop利用

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

Ähnlich wie HBase with Apache Spark POC Demo

DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Apache spark installation [autosaved]Shweta Patnaik

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Apache Cassandra and Apche SparkAlex Thompson

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Spark Working Environment in Windows OSUniversiti Technologi Malaysia (UTM)

Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)

Introduction to Apache Spark EcosystemBojan Babic

Playing with Hadoop (NPW2013)Søren Lund

Apache Spark OverviewDharmjit Singh

Migrating structured data between Hadoop and RDBMSBouquet

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

TriHUG talk on Spark and Sharktrihug

Up and running with pysparkKrishna Sangeeth KS

How to start using ScalaNgoc Dao

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov

spark_v1_2Frank Schroeter

Spark Summit EU talk by Jim DowlingSpark Summit

Ähnlich wie HBase with Apache Spark POC Demo (20)

DUG'20: 02 - Accelerating apache spark with DAOS on Aurora

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Apache spark installation [autosaved]

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Apache Cassandra and Apche Spark

Fast Data Analytics with Spark and Python

Spark Working Environment in Windows OS

Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!

Introduction to Apache Spark Ecosystem

Playing with Hadoop (NPW2013)

Apache Spark Overview

Migrating structured data between Hadoop and RDBMS

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

TriHUG talk on Spark and Shark

Up and running with pyspark

How to start using Scala

Using pySpark with Google Colab & Spark 3.0 preview

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

spark_v1_2

Spark Summit EU talk by Jim Dowling

Mehr von Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri

Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

No more struggles with Apache Spark workloads in productionChetan Khatri

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionChetan Khatri

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri

HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...Chetan Khatri

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri

An Introduction to Spark with ScalaChetan Khatri

HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...Chetan Khatri

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Fossasia 2018-chetan-khatriChetan Khatri

Fossasia ai-ml technologies and application for product development-chetan kh...Chetan Khatri

An Introduction Linear Algebra for Neural Networks and Deep learningChetan Khatri

Introduction to Computer ScienceChetan Khatri

An introduction to Git with Atlassian SuiteChetan Khatri

Think machine-learning-with-scikit-learn-chetanChetan Khatri

A step towards machine learning at accionlabsChetan Khatri

Voltage measurement using arduinoChetan Khatri

Mehr von Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...

Demystify Information Security & Threats for Data-Driven Platforms With Cheta...

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

No more struggles with Apache Spark workloads in production

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...

An Introduction to Spark with Scala

HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Fossasia 2018-chetan-khatri

Fossasia ai-ml technologies and application for product development-chetan kh...

An Introduction Linear Algebra for Neural Networks and Deep learning

Introduction to Computer Science

An introduction to Git with Atlassian Suite

Think machine-learning-with-scikit-learn-chetan

A step towards machine learning at accionlabs

Voltage measurement using arduino

Kürzlich hochgeladen

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Introduction-to-Machine-Learning (1).pptxfirstjob4

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Ukraine War presentation: KNOW THE BASICSAishani27

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Kürzlich hochgeladen (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati

BabyOno dropshipping via API with DroFx.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Introduction-to-Machine-Learning (1).pptx

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Brighton SEO | April 2024 | Data Storytelling

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Ukraine War presentation: KNOW THE BASICS

Generative AI on Enterprise Cloud with NiFi and Milvus

04242024_CCC TUG_Joins and Relationships

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Log Analysis using OSSEC sasoasasasas.pptx

Sampling (random) method and Non random.ppt

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

VidaXL dropshipping via API with DroFx.pptx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Edukaciniai dropshipping via API with DroFx

HBase with Apache Spark POC Demo

1. HBase with Spark POC Demo 21st Dec, 2016 Chetan Khatri - Global Lead of Data Science, Accionlabs Inc.

2. Current Environment ● Ubuntu 16.04 ● Apache Hadoop 2.7.3 ● Apache HBase 1.2.4 ● Apache Spark 2.0.2 ● Scala 2.12 ● SBT(Scala Build Tool) 0.13.13 Note: Everything is in Pseudo Distributed mode

3. Action Read Table Write to Table

4. Approach 1. Created Table at HBase 2. Generated Dummy Data in Spark as a RDD and Saved to HBase 3. Reading HBase Table as a HadoopRDD in Spark 4. Traversing HadoopRDD Elements and Printing it out. 5. Entire Spark Transformation coded in Scala 6. Generated Executable Uber(“Super”) Jar with SBT 7. Submitted Spark job using spark-submit

5. Source Code - SparkHbaseJob.scala

6. Deployment 1. Add SBT Assembly Plugin a. addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3") b. At Plugins.sbt file under Scala Project 2. Build Uber(“Super”) Jar a. Sbt clean assembly 3. Submit Spark Job a. bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar

7. Demo

8. TODO 1. Read HBase Table and Store HBaseRDD to Apache Hive with Spark. a. Incremental load from HBase to Hive b. SparkSQL / SparkRDD and Apache Drill Performance Benchmark for Reading HBase. 2. Save Summarized Spark Aggregation to PostgreSQL

9. Assumed : High Level Architecture Event Queuing Kafka Batch Incremental load / Transformations Reporting Tool (Dashboard) Processed DW - Ad hoc Analysis / 1st level aggregation Data Ingestion Pipeline RestfulAPI’s Summarized DB - KPI Reporting

10. Reference Github: https://github.com/chetkhatri/SparkHbasePOC