Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Running Apache Spark & Apache Zeppelin in Production

7.362 Aufrufe

Veröffentlicht am

Running Apache Spark & Apache Zeppelin in Production

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Running Apache Spark & Apache Zeppelin in Production

  1. 1. Running Apache Spark & Apache Zeppelin in Production Director, Product Management August 31, 2016 Twitter: @neomythos Vinay Shukla
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who am i?  Product Management  Spark for 2.5 + years, Hadoop for 3+ years  Recovering Programmer  Blog at www.vinayshukla.com  Twitter: @neomythos  Addicted to Yoga, Hiking, & Coffee  Smallest contributor to Apache Zeppelin Vinay Shukla
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security: Rings of Defense Perimeter Level Security •Network Security (i.e. Firewalls) Data Protection •Wire encryption •HDFS TDE/Dare •Others Authentication •Kerberos •Knox (Other Gateways) OS Security Authorization •Apache Ranger/Sentry •HDFS Permissions •HDFS ACLs •YARN ACL Page 3
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with Spark Ex Spark on YARN Zeppelin Spark- Shell Ex Spark Thrift Server Driver REST ServerDriver Driver Driver
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Context: Spark Deployment Modes • Spark on YARN –Spark driver (SparkContext) in YARN AM(yarn-cluster) –Spark driver (SparkContext) in local (yarn-client): • Spark Shell & Spark Thrift Server runs in yarn-client only Client Executor App MasterSpark Driver Client Executor App Master Spark Driver YARN-Client YARN-Cluster
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark on YARN Spark Submit John Doe Spark AM Spark AM 1 Hadoop Cluster HDFS Executor YARN RM YARN RM 4 2 3 Node Manager
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Security – Four Pillars  Authentication  Authorization  Audit  Encryption Spark leverages Kerberos on YARN Ensure network is secure
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos authentication within Spark KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity Get service ticket for Spark, John Doe Spark AMSpark AM NNNN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@ EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark RangerRangerCan John launch this job? Can John read this file John Doe
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Encryption: Spark – Communication Channels Spark Submit RM Shuffle Service AM Driver NM Ex 1 Ex N Shuffle Data Control/RPC Shuffle BlockTransfer Data Source Read/Write Data FS – Broadcast, File Download
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Communication Encryption Settings Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sharp Edges with Spark Security  SparkSQL – Only coarse grain access control today  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Lowers security, forces STS to run as Hive user to read all data – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  Spark Stream + Kafka + Kerberos – Issues fixed in HDP 2.4.x – No SSL support yet  Spark Shuffle > Only SASL, no SSL support  Spark Shuffle > No encryption for spill to disk or intermediate data Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin Security
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authentication + SSL Spark on YARN Ex Ex LDAP John Doe 1 2 3 SSL Firewall
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin + Livy E2E Security Zeppelin Spark Yarn Livy Ispark Group Interpreter SPNego: Kerberos Kerberos/RPC Livy APIs LDAP John Doe Job runs as John Doe
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authorization  Note level authorization  Grant Permissions (Owner, Reader, Writer) to users/groups on Notes  LDAP Group integration  Zeppelin UI Authorization  Allow only admins to configure interpreter  Configured in shiro.ini  For Spark with Zeppelin > Livy > Spark – Identity Propagation Jobs run as End-User  For Hive with Zeppelin > JDBC interpreter  Shell Interpreter – Runs as end-user Authorization in Zeppelin Authorization at Data Level [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Credentials  LDAP/AD account  Zeppelin leverages Hadoop Credential API  Interpreter Credentials  Not solved yet  Credentials Credentials in Zeppelin This is still an open issue
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. /etc/zeppelin/conf/shiro.ini [urls] /api/version = anon /** = authc Configure Zeppelin to Authenticate users Zeppelin leverages Apache Shiro for authentication/authoriza tion
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. Create an entry for AD credential – Zeppelin leverages Hadoop Credential API – >hadoop credential create – activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks – chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials 1. Configure Zeppelin to use AD activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM #activeDirectoryRealm.systemPassword = Password1! activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389 activeDirectoryRealm.groupRolesMap = "CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin” activeDirectoryRealm.authorizationCachingEnabled = true Active Directory Authentication Skip step 1 if securing LDAP password is not an issue
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Performance
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved #1 Big or Small Executor ? spark-submit --master yarn --deploy-mode client --num-executors ? --executor-cores ? --executor-memory ? --class MySimpleApp mySimpleApp.jar arg1 arg2
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Show the Details  Cluster resource: 72 cores + 288GB Memory 24 cores 96GB Memory 24 cores 96GB Memory 24 cores 96GB Memory
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark with Different Combinations Cores Per E# Memory Per E (GB) E Per Node 1 6 18 2 12 9 3 18 6 6 36 3 9 54 2 18 108 1 #E stands for Executor Except too small or too big executor, the performance is relatively close 18 cores 108GB memory per node# The lower the better
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Or Small Executor ?  Avoid too small or too big executor. – Small executor will decrease CPU and memory efficiency. – Big executor will introduce heavy GC overhead.  Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Any Other Thing ?  Executor memory != Container memory  Container memory = executor memory + overhead memory (10% of executory memory)  Leave some resources to os and other services  Enable CPU scheduling if you want to constrain CPU usage# #CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-y
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Multi-tenancy for Spark  Leverage YARN queues – Set user quotas – Set default yarn-queue in spark-defaults – User can override for each job  Leverage Dynamic Resource Allocation – Specify range of executors a job uses – This needs shuffle service to be used Cluster resource Utilization
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Vinay Shukla @neomythos

×