Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Zeppelin and spark sql demystified

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Emr spark tuning demystified
Emr spark tuning demystified
Wird geladen in …3
×

Hier ansehen

1 von 15 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Zeppelin and spark sql demystified (20)

Anzeige

Weitere von Omid Vahdaty (20)

Aktuellste (20)

Anzeige

Zeppelin and spark sql demystified

  1. 1. Zeppelin and Spark SQL AWS BIG DATA demystified Omid Vahdaty, Big Data Ninja
  2. 2. Agenda ● What is Zeppelin? ● What is Spark SQL? ● Motivation? ● Features? ● Performance? ● Demo?
  3. 3. Zeppelin A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi- purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
  4. 4. Zeppelin out of the box features ● Web Based GUI. ● Supported languages ○ SparkSQL ○ PySpark ○ Scala ○ SparkR ○ JDBC (Redshift,Athena, Presto,MySql ...) ○ Bash ● Visualization ● Users, Sharing and Collaboration ● Advanced Security features ● Built in AWS S3 support ● Orchestration
  5. 5. What is Spark SQL ● Spark SQL is a Spark module for structured data processing. Unlike the basicSpark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. ● HiveSQL
  6. 6. Why Spark SQL? ● Simple ● Scalable ● Performance - faster than Hive ● External tables on S3 ● Cost Reduction ● Decrease the GAP between Data Science and Data Engineering: HiveQL for ALL ● Get us one step closer to use sparkR / pyspark/ scala ● JDBC connection enabled via thrift server. ● Concurrency via Yarn Scheduler :) ● Join is runs better here than hive. [still not redshift]
  7. 7. Why Not SparkSql? ● Buggy ● Not as fast as scala ● Not code <----> SQL ● Known issues: ○ Performance over S3 → BAD ○ Insert Overwrite → not overwriting … bug? ○ Chunk size control → bug? ○ Dynamic partitions… non trivial ○ Beeline client/server version mismatch (CLI)
  8. 8. Why SparkSql + Zeppelin ● Sexy Look and Feel of any SQL web client ● Backup your SQL easily automatically via S3 ● Share your work ● Orchestration & Scheduler for your nightly job ● Combine system commands + sql + visualization. ● Advanced Security features. ● Combine all the DB’s you need in one place including data transfer. ● Get one step closer to spark and scala ● Visualize your data easily.
  9. 9. Performance of Spark SQL ● EMR is already pre-configured in terms of spark configurations: ○ spark.executor.instances (--num-executors) ○ spark.executor.cores (--executor-cores) ○ spark.executor.memory (--executor-memory) ● X10 faster than hive in select aggregations ● X5 faster than hive when working on top of S3 ● Performance Penalty is greatest on ○ Insert overwrite ○ Write to s3
  10. 10. Connection via JDBC hive_metastore.jar hive_service.jar HiveJDBC41.jar libfb303-0.9.0.jar libthrift-0.9.0.jar log4j-1.2.14.jar ql.jar slf4j-api-1.5.11.jar slf4j-log4j12-1.5.11.jar TCLIServiceClient.jar zookeeper-3.4.6.jar
  11. 11. Turing Spark SQL for write spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2") for read: spark.hadoop.parquet.enable.summary-metadata false spark.sql.parquet.mergeSchema false spark.sql.parquet.filterPushdown true spark.sql.hive.metastorePartitionPruning true
  12. 12. Performance Testing -- data transformation Read/Write from aws s3 Hive Spark SQL Aggregation query 10 min 1 min Text Gzip → Parquet 10 min ~2 min Same as above with s3DistCP 10 min ~4 min Text Gzip → Parquet gzip 10 min ~18 min Text parquet → Parquet-gzip ~2 min Parquet-gzip → Parquet-gzip ~2 min ● Observations ○ Penalty on s3 write ○ No Penalty on S3 read even if uncompressed ○ Compression is not always good...
  13. 13. Take away message to performance challenges ● Chunk size has no impact on performance. But helps on parallelism. ● Most spark config have minor impact. ● Use s3a ● Use compression carefully ○ [in the create table definition] ○ Choose compression algorithm carefully ● Using S3DistCP - is ○ Slower than direct write to s3 with compression. ○ Make you want to kill yourself when you work with dynamic partitions. ● Bottom line takeaways ○ Don't compress when transforming Gzip text file → parquet ○ Compress when transforming uncompressed parquet → gzip parquet ○ No need for any other configuration except for s3a and compression.
  14. 14. Turing resources https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html https://zeppelin.apache.org/docs/latest/interpreter/spark.html https://community.hortonworks.com/questions/33484/spark-sql-query-execution-is-very-very-slow-when-c.html https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark https://docs.databricks.com/spark/latest/faq/append-slow-with-spark-2.0.0.html https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98 https://www.slideshare.net/JasonHubbard10/spark-meetup-73215961 https://hortonworks.github.io/hdp-aws/s3-spark/index.html https://hortonworks.github.io/hdp-aws/s3-performance/index.html http://agrajmangal.in/blog/big-data/spark-parquet-s3/
  15. 15. Resources [1] http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ [2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html#emr-hadoop-task-jvm [3] http://spark.apache.org/docs/latest/submitting-applications.html [4] http://spark.apache.org/docs/latest/cluster-overview.html [5] http://spark.apache.org/docs/latest/running-on-yarn.html [6] https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html [7] http://docs.aws.amazon.com/emr/latest/ReleaseGuide/latest/emr-spark-configure.html [8] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html#zeppelin-considerations [9] http://agrajmangal.in/blog/big-data/spark-parquet-s3/ [10] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

×