SlideShare ist ein Scribd-Unternehmen logo
1 von 29
EMR Zeppelin & Livy
AWS BIG DATA demystified
Omid Vahdaty, Big Data Ninja
Agenda
● What is Zeppelin?
● Motivation?
● Features?
● Performance?
● Demo?
Zeppelin
A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi-
purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
Zeppelin out of the box features
● Web Based GUI.
● Supported languages
○ Spark SQL
○ PySpark
○ Scala
○ SparkR
○ JDBC (Redshift,Athena, Presto,MySql ...)
○ Bash
● Visualization
● Users, Sharing and Collaboration
● Advanced Security features
● Built in AWS S3 support
● Orchestration
Why Zeppelin?
● Sexy Look and Feel of any SQL web client
● Backup your SQL easily automatically via S3
● Share and collaborate your notebooks
● Orchestration & Scheduler for your nightly job
● Combine system commands + sql + scala spark visualization.
● Advanced Security features
● Combine all the DB’s you need in one place including data transfer.
● Get one step closer to pyspark and scala and sparkR
● Visualize your data easily.
Getting started - Provisioning EMR
● Zeppelin is installed on the master node of the EMR cluster ( choose the right
installation for you )
● Don't forget to add the AWS glue connectors
● Dont forget to add Spark …
● https://zeppelin.apache.org/docs/0.7.3/
● ML notebook example with zeppelin
● https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-notebooks/hdp-2.6/2CCBNZ5YY/note.json
Interpreter
● The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin
supports many interpreters such as Scala ( with Apache Spark ), Python ( with Apache Spark ), Spark SQL, JDBC, Markdown, Shell
and so on.
● SparkContext, SQL context , Zeppelin cotext Z SparkContext, SQLContext and ZeppelinContext are automatically
created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1
SparkSession is available as variable spark when you are using Spark 2.x.
● https://zeppelin.apache.org/docs/latest/manual/interpreters.html
Binding modes
1. In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple
Interpreter Group serve each Note
2. In Shared mode, single JVM process and single Interpreter Group serves all
Notes.
3. Isolated mode runs separate interpreter process for each Note. So, each Note
have absolutely isolated session.
Binding modes
Binding modes
Binding modes - share mode
In Shared mode, single JVM process
and a single session serves all notes.
As a result, note A can access
variables (e.g python, scala, ..) directly
created from other notes..
Binding modes - scoped mode
In Scoped mode, Zeppelin still runs a
single interpreter JVM process but, in
the case of per note scope, each note
runs in its own dedicated session. (Note
it is still possible to share objects
between these notes via ResourcePool)
Binding modes - Isolated mode
Isolated mode runs a separate
interpreter process for each note in the
case of per note scope. So, each note
has an absolutely isolated session. (But
it is still possible to share objects via
ResourcePool)
When to use each binding mode?
● Isolated means high utilization of resources but less availability to share
options to share objects
● In Scoped mode, each note has its own Scala REPL. So variable defined in a
note can not be read or overridden in another note. However, a single
SparkContext still serves all the sessions. And all the jobs are submitted to this
SparkContext and the fair scheduler schedules the jobs. This could be useful
when user does not want to share Scala session, but want to keep single
Spark application and leverage its fair scheduler.
● In Shared mode, a SparkContext and a Scala REPL is being shared among all
interpreters in the group. So every note will be sharing single SparkContext and
single Scala REPL
Import/Export Notebooks
● U can import /export notebooks into from Url, local disk or Zeppelin Storage: S3 and GI
● Zeppelin storage s3 notes.
○ Need to import from local disk the first time
○ U can use roles to provide access to S3 instead of access key / secret key
○ Each notebook is saved on s3 in a specific path (see docs)
○ Can’t open directly from S3- bug?
○ Yes, you can use encryption of S3...
Zeppelin storage s3 (use role instead of accesskey/secretkey)
https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/
{
"Classification": "zeppelin-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [
]
}
Advanced Security
● Basic authentication (via Apache SHIRO): user management
(user,pass,groups), even LDAP
https://zeppelin.apache.org/docs/0.7.3/security/shiroauthentication.html
● notebook permissions management: read/write/share
https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html
● Data source authorization (e.g 3rd party DB):
https://zeppelin.apache.org/docs/0.7.3/security/datasource_authorization.html
● Zeppelin with kerberos
● https://zeppelin.apache.org/docs/latest/interpreter/spark.html#setting-up-zeppelin-with-kerberos
HTTPS/SSL
● You can use a tunnel as used in EMR GUI websites (secured by default)
● Authentication and SSL via nginx
○ https://zeppelin.apache.org/docs/0.7.3/security/authentication.html#http-basic-authentication-using-nginx
○ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/config-ssl-zepp.html
● you can add ELB on top of EMR , in 443, out 8890 for the zeppelin gui via
HTTPS
User management
Now in order to manage groups/roles, you could create the groups/roles under the
"[roles]" section in the "shiro.ini" file. For example, I could have a set of groups like:
[roles]
admin = *
readonly = *
poweruser = *
scientist = *
engineer = *
```
User management
Then in the "[users]" sections, it could be looking like the below:
```
[users]
admin = password>, admin
user1 = password>, scientist, poweruser
user2 = password>, engineer, poweruser
user3 = password>, readonly
```
User management
For example, the above means that:
- user "admin" is in "admin" group;
- user "user1" is in "poweruser" and "scientist" group
- etc.
Once the groups/roles are created, the authorization setting will be similar to what described in
https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html . For instance, when in a notebook permission
page, you can put the group name, instead of the individual users:
```
Owners admin
Writers scientist,engineer,poweruser
Readers readonly
```
Orchestration & Scheduling
You can go to any Zeppelin notebook and click on clock icon to setup scheduling
using CRON. You can use this link to generate the CRON expression for the time
interested - http://www.cronmaker.com/.
Orchestration & Scheduling
You can ran any job if our have permission and see their status
Bootstrapping EMR zeppelin
● For launching EMR cluster with a pre-defined notebook, we can make use of
Amazon S3 for persistent storage of the notebook and EMR steps since EMR
Bootstrap Actions are run before Zeppelin is installed on the cluster.
● sudo aws s3 cp s3://<my bucket name>/<location>/zeppelin-site.xml
/etc/zeppelin/conf/
● aws s3 cp /etc/zeppelin/conf.dist/shiro.ini s3://my-zeppelin/config/
● sudo stop zeppelin
● sudo start zeppelin
Apache Livy
rest api to manage spark jobs
● Interactive Scala, Python and R shells
● Batch submissions in Scala, Java, Python
● Multi users can share the same zeppelin server (impersonation support)
● Can be used for submitting jobs from anywhere with REST
● Does not require any code change to your programs
Livy + Zeppelin use case
Multi tenant users/jobs:
● Sharing of Spark context across multiple Zeppelin instances.
● When the Zeppelin server runs with authentication enabled, the Livy interpreter
propagates user identity to the Spark job so that the job runs as the originating
user. This is especially useful when multiple users are expected to connect to
the same set of data repositories within an enterprise.
EMR bootstrap of zeppelin in an EMR STEP
If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your
zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service.
To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as
its only argument.
To do this via the AWS EMR Console:
1 - Under the "Add steps (optional)" section, select "Custom JAR" for the "Step type" and click the "Configure" button.
2 - In the pop-up window, for us-east-1 the JAR location for script-runner.jar is:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
3 - For the argument, you would pass in your S3 bucket and location of the "setupZeppelin.sh" file e.g.,:
s3://mybucket/mylocation/setupZeppelin.sh
Once done, click "Add" and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).
Livy + Zeppelin Architecture
Resources
● https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
● https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/interpreter/interpreter_binding_mode.html
● https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/
● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-local-git-repository
● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-s3
● encryption on s3 :https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#data-encryption-in-s3
● Reference - https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html.
● https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/zepp-with-spark.html
● https://zeppelin.apache.org/docs/0.6.1/interpreter/livy.html
● https://hortonworks.com/blog/recent-improvements-apache-zeppelin-livy-integration/
● https://www.slideshare.net/HadoopSummit/apache-zeppelin-livy-bringing-multi-tenancy-to-interactive-data-analysis

Weitere ähnliche Inhalte

Was ist angesagt?

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceRomi Kuntsman
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 

Was ist angesagt? (20)

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and Performance
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 

Ähnlich wie Emr zeppelin & Livy demystified

Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.jsSu Zin Kyaw
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May MeetupDeepak Garg
 
New c sharp4_features_part_vi
New c sharp4_features_part_viNew c sharp4_features_part_vi
New c sharp4_features_part_viNico Ludwig
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applicationsCorley S.r.l.
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @WindwardDemi Ben-Ari
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 

Ähnlich wie Emr zeppelin & Livy demystified (20)

Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.js
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2) AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2)
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May Meetup
 
New c sharp4_features_part_vi
New c sharp4_features_part_viNew c sharp4_features_part_vi
New c sharp4_features_part_vi
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applications
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 

Mehr von Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo dbOmid Vahdaty
 

Mehr von Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 

Kürzlich hochgeladen

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Kürzlich hochgeladen (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

Emr zeppelin & Livy demystified

  • 1. EMR Zeppelin & Livy AWS BIG DATA demystified Omid Vahdaty, Big Data Ninja
  • 2. Agenda ● What is Zeppelin? ● Motivation? ● Features? ● Performance? ● Demo?
  • 3. Zeppelin A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi- purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
  • 4. Zeppelin out of the box features ● Web Based GUI. ● Supported languages ○ Spark SQL ○ PySpark ○ Scala ○ SparkR ○ JDBC (Redshift,Athena, Presto,MySql ...) ○ Bash ● Visualization ● Users, Sharing and Collaboration ● Advanced Security features ● Built in AWS S3 support ● Orchestration
  • 5. Why Zeppelin? ● Sexy Look and Feel of any SQL web client ● Backup your SQL easily automatically via S3 ● Share and collaborate your notebooks ● Orchestration & Scheduler for your nightly job ● Combine system commands + sql + scala spark visualization. ● Advanced Security features ● Combine all the DB’s you need in one place including data transfer. ● Get one step closer to pyspark and scala and sparkR ● Visualize your data easily.
  • 6. Getting started - Provisioning EMR ● Zeppelin is installed on the master node of the EMR cluster ( choose the right installation for you ) ● Don't forget to add the AWS glue connectors ● Dont forget to add Spark … ● https://zeppelin.apache.org/docs/0.7.3/ ● ML notebook example with zeppelin ● https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-notebooks/hdp-2.6/2CCBNZ5YY/note.json
  • 7. Interpreter ● The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Scala ( with Apache Spark ), Python ( with Apache Spark ), Spark SQL, JDBC, Markdown, Shell and so on. ● SparkContext, SQL context , Zeppelin cotext Z SparkContext, SQLContext and ZeppelinContext are automatically created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1 SparkSession is available as variable spark when you are using Spark 2.x. ● https://zeppelin.apache.org/docs/latest/manual/interpreters.html
  • 8. Binding modes 1. In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple Interpreter Group serve each Note 2. In Shared mode, single JVM process and single Interpreter Group serves all Notes. 3. Isolated mode runs separate interpreter process for each Note. So, each Note have absolutely isolated session.
  • 11. Binding modes - share mode In Shared mode, single JVM process and a single session serves all notes. As a result, note A can access variables (e.g python, scala, ..) directly created from other notes..
  • 12. Binding modes - scoped mode In Scoped mode, Zeppelin still runs a single interpreter JVM process but, in the case of per note scope, each note runs in its own dedicated session. (Note it is still possible to share objects between these notes via ResourcePool)
  • 13. Binding modes - Isolated mode Isolated mode runs a separate interpreter process for each note in the case of per note scope. So, each note has an absolutely isolated session. (But it is still possible to share objects via ResourcePool)
  • 14. When to use each binding mode? ● Isolated means high utilization of resources but less availability to share options to share objects ● In Scoped mode, each note has its own Scala REPL. So variable defined in a note can not be read or overridden in another note. However, a single SparkContext still serves all the sessions. And all the jobs are submitted to this SparkContext and the fair scheduler schedules the jobs. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler. ● In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every note will be sharing single SparkContext and single Scala REPL
  • 15. Import/Export Notebooks ● U can import /export notebooks into from Url, local disk or Zeppelin Storage: S3 and GI ● Zeppelin storage s3 notes. ○ Need to import from local disk the first time ○ U can use roles to provide access to S3 instead of access key / secret key ○ Each notebook is saved on s3 in a specific path (see docs) ○ Can’t open directly from S3- bug? ○ Yes, you can use encryption of S3...
  • 16. Zeppelin storage s3 (use role instead of accesskey/secretkey) https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] }
  • 17. Advanced Security ● Basic authentication (via Apache SHIRO): user management (user,pass,groups), even LDAP https://zeppelin.apache.org/docs/0.7.3/security/shiroauthentication.html ● notebook permissions management: read/write/share https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html ● Data source authorization (e.g 3rd party DB): https://zeppelin.apache.org/docs/0.7.3/security/datasource_authorization.html ● Zeppelin with kerberos ● https://zeppelin.apache.org/docs/latest/interpreter/spark.html#setting-up-zeppelin-with-kerberos
  • 18. HTTPS/SSL ● You can use a tunnel as used in EMR GUI websites (secured by default) ● Authentication and SSL via nginx ○ https://zeppelin.apache.org/docs/0.7.3/security/authentication.html#http-basic-authentication-using-nginx ○ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/config-ssl-zepp.html ● you can add ELB on top of EMR , in 443, out 8890 for the zeppelin gui via HTTPS
  • 19. User management Now in order to manage groups/roles, you could create the groups/roles under the "[roles]" section in the "shiro.ini" file. For example, I could have a set of groups like: [roles] admin = * readonly = * poweruser = * scientist = * engineer = * ```
  • 20. User management Then in the "[users]" sections, it could be looking like the below: ``` [users] admin = password>, admin user1 = password>, scientist, poweruser user2 = password>, engineer, poweruser user3 = password>, readonly ```
  • 21. User management For example, the above means that: - user "admin" is in "admin" group; - user "user1" is in "poweruser" and "scientist" group - etc. Once the groups/roles are created, the authorization setting will be similar to what described in https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html . For instance, when in a notebook permission page, you can put the group name, instead of the individual users: ``` Owners admin Writers scientist,engineer,poweruser Readers readonly ```
  • 22. Orchestration & Scheduling You can go to any Zeppelin notebook and click on clock icon to setup scheduling using CRON. You can use this link to generate the CRON expression for the time interested - http://www.cronmaker.com/.
  • 23. Orchestration & Scheduling You can ran any job if our have permission and see their status
  • 24. Bootstrapping EMR zeppelin ● For launching EMR cluster with a pre-defined notebook, we can make use of Amazon S3 for persistent storage of the notebook and EMR steps since EMR Bootstrap Actions are run before Zeppelin is installed on the cluster. ● sudo aws s3 cp s3://<my bucket name>/<location>/zeppelin-site.xml /etc/zeppelin/conf/ ● aws s3 cp /etc/zeppelin/conf.dist/shiro.ini s3://my-zeppelin/config/ ● sudo stop zeppelin ● sudo start zeppelin
  • 25. Apache Livy rest api to manage spark jobs ● Interactive Scala, Python and R shells ● Batch submissions in Scala, Java, Python ● Multi users can share the same zeppelin server (impersonation support) ● Can be used for submitting jobs from anywhere with REST ● Does not require any code change to your programs
  • 26. Livy + Zeppelin use case Multi tenant users/jobs: ● Sharing of Spark context across multiple Zeppelin instances. ● When the Zeppelin server runs with authentication enabled, the Livy interpreter propagates user identity to the Spark job so that the job runs as the originating user. This is especially useful when multiple users are expected to connect to the same set of data repositories within an enterprise.
  • 27. EMR bootstrap of zeppelin in an EMR STEP If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service. To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as its only argument. To do this via the AWS EMR Console: 1 - Under the "Add steps (optional)" section, select "Custom JAR" for the "Step type" and click the "Configure" button. 2 - In the pop-up window, for us-east-1 the JAR location for script-runner.jar is: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar 3 - For the argument, you would pass in your S3 bucket and location of the "setupZeppelin.sh" file e.g.,: s3://mybucket/mylocation/setupZeppelin.sh Once done, click "Add" and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).
  • 28. Livy + Zeppelin Architecture
  • 29. Resources ● https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555 ● https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/interpreter/interpreter_binding_mode.html ● https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/ ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-local-git-repository ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-s3 ● encryption on s3 :https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#data-encryption-in-s3 ● Reference - https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html. ● https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/zepp-with-spark.html ● https://zeppelin.apache.org/docs/0.6.1/interpreter/livy.html ● https://hortonworks.com/blog/recent-improvements-apache-zeppelin-livy-integration/ ● https://www.slideshare.net/HadoopSummit/apache-zeppelin-livy-bringing-multi-tenancy-to-interactive-data-analysis