Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Operational Tips For Deploying Apache Spark

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 27 Anzeige

Operational Tips For Deploying Apache Spark

Herunterladen, um offline zu lesen

Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?

Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Operational Tips For Deploying Apache Spark (20)

Weitere von Databricks (20)

Anzeige

Aktuellste (20)

Operational Tips For Deploying Apache Spark

  1. 1. Operational Tips for Deploying Apache Spark® Miklos Christine Solutions Architect Databricks ™
  2. 2. Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation $ whoami • Previously Systems Engineer @ Cloudera • Deep Knowledge of Big Data Stack • Apache Spark Expert • Solutions Architect @ Databricks!
  3. 3. What Will I Learn? • Quick Apache Spark Overview • Configuration Systems • Pipeline Design Best Practices • Debugging Techniques
  4. 4. Apache Spark
  5. 5. Apache Spark Configuration
  6. 6. • Command Line: spark-defaults.conf spark-env.sh • Programmatically: SparkConf() • Hadoop Configs: core-site.xml hdfs-site.xml // Print SparkConfig sc.getConf.toDebugString // Print Hadoop Config val hdConf = sc.hadoopConfiguration.iterator() while (hdConf.hasNext){ println(hdConf.next().toString()) } Apache Spark Core Configurations
  7. 7. • Set SQL Configs Through SQL Interface SET key=value; sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”) • Tools to see current configurations // View SparkSQL Config Properties val sqlConf = sqlContext.getAllConfs sqlConf.foreach(x => println(x._1 +" : " + x._2)) SparkSQL Configurations
  8. 8. • File Formats • Compression Codecs • Apache Spark APIs • Job Profiles Apache Spark Pipeline Design
  9. 9. • Text File Formats – CSV – JSON • Avro Row Format • Parquet Columnar Format User Story: 260GB CSV Data Converted to 23GB Parquet File Formats
  10. 10. • Choose and Analyze Compression Codecs – Snappy, Gzip, LZO • Configuration Parameters – io.compression.codecs – spark.sql.parquet.compression.codec Compression
  11. 11. • Small files problem still exists • Metadata loading • Use coalesce() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame Small Files Problem
  12. 12. • 2 Types of Partitioning – Spark – Table Level # Get Number of Spark > df.rdd.getNumPartitions() 40 What are Partitions?
  13. 13. • Apache Spark APIs – repartition() – coalesce() # Re-partition a DataFrame > df_10 = df.repartition(10) df = sqlContext.read. jdbc(url=jdbcUrl, table='employees', column='emp_no', lowerBound=1, upperBound=100000, numPartitions=100) df.repartition(20).write. parquet('/mnt/mwc/jdbc_part/') Partition Controls
  14. 14. • Partition by a column value within the table > df.write. partitionBy(“colName”). saveAsTable(“tableName”) Table Partitions
  15. 15. • SparkSQL Shuffle Partitions spark.sql.shuffle.partitions sqlCtx.sql("set spark.sql.shuffle.partitions=600”) sqlCtx.sql("select a1.name, a2.name from adult a1 join adult a2 where a1.age = a2.age") sqlCtx.sql("select count(distinct(name)) from adult") Those Other Partitions
  16. 16. • Q: Will increasing my cluster size help with my job? • A: It depends. How Does This Help?
  17. 17. • Leverage Spark UI – SQL – Streaming Apache Spark Job Profiles
  18. 18. Apache Spark Job Profiles
  19. 19. Apache Spark Job Profiles
  20. 20. • Monitoring& Metrics – Spark – Servers • Toolset – Ganglia – Graphite Ref: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Apache Spark Job Profile: Metrics
  21. 21. • Analyze the Driver’s stacktrace. • Analyze the executorsstacktraces – Find the initial executor’s failure. • Review metrics – Memory – Disk – Networking Debugging Apache Spark
  22. 22. • Know your tools: JDBC vs ODBC. How to test? What can I test? – RedShift / Mysql / Tableau to Apache Spark ,etc. • Json SparkSQL for corrupt records sqlCtx.read.json("/jsonFiles/").registerTempTable("jsonTable") sqlCtx.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL") • Steps to debug SQL issues – Where’s the data, what’s the DDL? Debugging Apache Spark
  23. 23. • OutOfMemoryErrors – Driver – Executors • Out of Disk Space Issues • Long GC Pauses • API Usage Top Support Issues
  24. 24. • Use builtin functions instead of custom UDFs – import pyspark.sql.functions – import org.apache.spark.sql.functions • Examples: – to_date() – get_json_object() – monotonically_increasing_id() – hour() / minute() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions Top Support Issues
  25. 25. • SQL query not returning new data – REFRESH TABLE <table_name> • ExportedParquet from ExternalSystems – spark.sql.parquet.binaryAsString • Tune number of Shuffle Partitions – spark.sql.shuffle.partitions Top Support Issues
  26. 26. • Download notebookfor thistalk at: dbricks.co/xyz • Try latest version ofApache Spark and preview of Spark 2.0 Try Apache Spark with Databricks 26 http://databricks.com/try
  27. 27. mwc@databricks.com https://www.linkedin.com/in/mrchristine @Miklos_C Thank you.

×