Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Emr zeppelin & Livy demystified

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Emr spark tuning demystified
Emr spark tuning demystified
Wird geladen in …3
×

Hier ansehen

1 von 29 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Emr zeppelin & Livy demystified (20)

Anzeige

Weitere von Omid Vahdaty (20)

Anzeige

Emr zeppelin & Livy demystified

  1. 1. EMR Zeppelin & Livy AWS BIG DATA demystified Omid Vahdaty, Big Data Ninja
  2. 2. Agenda ● What is Zeppelin? ● Motivation? ● Features? ● Performance? ● Demo?
  3. 3. Zeppelin A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi- purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
  4. 4. Zeppelin out of the box features ● Web Based GUI. ● Supported languages ○ Spark SQL ○ PySpark ○ Scala ○ SparkR ○ JDBC (Redshift,Athena, Presto,MySql ...) ○ Bash ● Visualization ● Users, Sharing and Collaboration ● Advanced Security features ● Built in AWS S3 support ● Orchestration
  5. 5. Why Zeppelin? ● Sexy Look and Feel of any SQL web client ● Backup your SQL easily automatically via S3 ● Share and collaborate your notebooks ● Orchestration & Scheduler for your nightly job ● Combine system commands + sql + scala spark visualization. ● Advanced Security features ● Combine all the DB’s you need in one place including data transfer. ● Get one step closer to pyspark and scala and sparkR ● Visualize your data easily.
  6. 6. Getting started - Provisioning EMR ● Zeppelin is installed on the master node of the EMR cluster ( choose the right installation for you ) ● Don't forget to add the AWS glue connectors ● Dont forget to add Spark … ● https://zeppelin.apache.org/docs/0.7.3/ ● ML notebook example with zeppelin ● https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-notebooks/hdp-2.6/2CCBNZ5YY/note.json
  7. 7. Interpreter ● The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Scala ( with Apache Spark ), Python ( with Apache Spark ), Spark SQL, JDBC, Markdown, Shell and so on. ● SparkContext, SQL context , Zeppelin cotext Z SparkContext, SQLContext and ZeppelinContext are automatically created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1 SparkSession is available as variable spark when you are using Spark 2.x. ● https://zeppelin.apache.org/docs/latest/manual/interpreters.html
  8. 8. Binding modes 1. In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple Interpreter Group serve each Note 2. In Shared mode, single JVM process and single Interpreter Group serves all Notes. 3. Isolated mode runs separate interpreter process for each Note. So, each Note have absolutely isolated session.
  9. 9. Binding modes
  10. 10. Binding modes
  11. 11. Binding modes - share mode In Shared mode, single JVM process and a single session serves all notes. As a result, note A can access variables (e.g python, scala, ..) directly created from other notes..
  12. 12. Binding modes - scoped mode In Scoped mode, Zeppelin still runs a single interpreter JVM process but, in the case of per note scope, each note runs in its own dedicated session. (Note it is still possible to share objects between these notes via ResourcePool)
  13. 13. Binding modes - Isolated mode Isolated mode runs a separate interpreter process for each note in the case of per note scope. So, each note has an absolutely isolated session. (But it is still possible to share objects via ResourcePool)
  14. 14. When to use each binding mode? ● Isolated means high utilization of resources but less availability to share options to share objects ● In Scoped mode, each note has its own Scala REPL. So variable defined in a note can not be read or overridden in another note. However, a single SparkContext still serves all the sessions. And all the jobs are submitted to this SparkContext and the fair scheduler schedules the jobs. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler. ● In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every note will be sharing single SparkContext and single Scala REPL
  15. 15. Import/Export Notebooks ● U can import /export notebooks into from Url, local disk or Zeppelin Storage: S3 and GI ● Zeppelin storage s3 notes. ○ Need to import from local disk the first time ○ U can use roles to provide access to S3 instead of access key / secret key ○ Each notebook is saved on s3 in a specific path (see docs) ○ Can’t open directly from S3- bug? ○ Yes, you can use encryption of S3...
  16. 16. Zeppelin storage s3 (use role instead of accesskey/secretkey) https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] }
  17. 17. Advanced Security ● Basic authentication (via Apache SHIRO): user management (user,pass,groups), even LDAP https://zeppelin.apache.org/docs/0.7.3/security/shiroauthentication.html ● notebook permissions management: read/write/share https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html ● Data source authorization (e.g 3rd party DB): https://zeppelin.apache.org/docs/0.7.3/security/datasource_authorization.html ● Zeppelin with kerberos ● https://zeppelin.apache.org/docs/latest/interpreter/spark.html#setting-up-zeppelin-with-kerberos
  18. 18. HTTPS/SSL ● You can use a tunnel as used in EMR GUI websites (secured by default) ● Authentication and SSL via nginx ○ https://zeppelin.apache.org/docs/0.7.3/security/authentication.html#http-basic-authentication-using-nginx ○ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/config-ssl-zepp.html ● you can add ELB on top of EMR , in 443, out 8890 for the zeppelin gui via HTTPS
  19. 19. User management Now in order to manage groups/roles, you could create the groups/roles under the "[roles]" section in the "shiro.ini" file. For example, I could have a set of groups like: [roles] admin = * readonly = * poweruser = * scientist = * engineer = * ```
  20. 20. User management Then in the "[users]" sections, it could be looking like the below: ``` [users] admin = password>, admin user1 = password>, scientist, poweruser user2 = password>, engineer, poweruser user3 = password>, readonly ```
  21. 21. User management For example, the above means that: - user "admin" is in "admin" group; - user "user1" is in "poweruser" and "scientist" group - etc. Once the groups/roles are created, the authorization setting will be similar to what described in https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html . For instance, when in a notebook permission page, you can put the group name, instead of the individual users: ``` Owners admin Writers scientist,engineer,poweruser Readers readonly ```
  22. 22. Orchestration & Scheduling You can go to any Zeppelin notebook and click on clock icon to setup scheduling using CRON. You can use this link to generate the CRON expression for the time interested - http://www.cronmaker.com/.
  23. 23. Orchestration & Scheduling You can ran any job if our have permission and see their status
  24. 24. Bootstrapping EMR zeppelin ● For launching EMR cluster with a pre-defined notebook, we can make use of Amazon S3 for persistent storage of the notebook and EMR steps since EMR Bootstrap Actions are run before Zeppelin is installed on the cluster. ● sudo aws s3 cp s3://<my bucket name>/<location>/zeppelin-site.xml /etc/zeppelin/conf/ ● aws s3 cp /etc/zeppelin/conf.dist/shiro.ini s3://my-zeppelin/config/ ● sudo stop zeppelin ● sudo start zeppelin
  25. 25. Apache Livy rest api to manage spark jobs ● Interactive Scala, Python and R shells ● Batch submissions in Scala, Java, Python ● Multi users can share the same zeppelin server (impersonation support) ● Can be used for submitting jobs from anywhere with REST ● Does not require any code change to your programs
  26. 26. Livy + Zeppelin use case Multi tenant users/jobs: ● Sharing of Spark context across multiple Zeppelin instances. ● When the Zeppelin server runs with authentication enabled, the Livy interpreter propagates user identity to the Spark job so that the job runs as the originating user. This is especially useful when multiple users are expected to connect to the same set of data repositories within an enterprise.
  27. 27. EMR bootstrap of zeppelin in an EMR STEP If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service. To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as its only argument. To do this via the AWS EMR Console: 1 - Under the "Add steps (optional)" section, select "Custom JAR" for the "Step type" and click the "Configure" button. 2 - In the pop-up window, for us-east-1 the JAR location for script-runner.jar is: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar 3 - For the argument, you would pass in your S3 bucket and location of the "setupZeppelin.sh" file e.g.,: s3://mybucket/mylocation/setupZeppelin.sh Once done, click "Add" and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).
  28. 28. Livy + Zeppelin Architecture
  29. 29. Resources ● https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555 ● https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/interpreter/interpreter_binding_mode.html ● https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/ ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-local-git-repository ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-s3 ● encryption on s3 :https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#data-encryption-in-s3 ● Reference - https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html. ● https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/zepp-with-spark.html ● https://zeppelin.apache.org/docs/0.6.1/interpreter/livy.html ● https://hortonworks.com/blog/recent-improvements-apache-zeppelin-livy-integration/ ● https://www.slideshare.net/HadoopSummit/apache-zeppelin-livy-bringing-multi-tenancy-to-interactive-data-analysis

×