Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Spark hands-on tutorial (rev. 002)

436 Aufrufe

Veröffentlicht am

I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Spark hands-on tutorial (rev. 002)

  1. 1. & Java NCDevCon Raleigh, NC October (5+2)th 2017
  2. 2. Jean Georges Perrin Software whatever since 1983 x9 @jgperrin http://jgp.net [blog] http://oplo.io [oplo]
  3. 3. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  4. 4. But most importantly… ๏ … who is not a developer?
  5. 5. Agenda ๏ What is ? ๏ What can I do with ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ More and more examples (times permit!)
  6. 6. Caution First time I am doing a hands-on tutorial Tons of content Unknown crowd Unknown setting
  7. 7. Title TextAnalytics Operating System
  8. 8. An Analytics Operating System? Hardware OS Apps
  9. 9. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS Apps
  10. 10. Apps Analytics Distrib. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS
  11. 11. Apps Analytics Distrib. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  12. 12. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  13. 13. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  14. 14. Use Cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ (they are hiring!) ๏ General compute ๏ Distributed data transfer ๏ IBM ๏ DSX (Data Science Experiment) ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ Z ๏ Data wrangling solution
  15. 15. What a Typical App Looks Like? Connect to the cluster Load Data Do something with the data Share the results
  16. 16. Convinced? On y va!
  17. 17. http://bit.ly/spark-clego
  18. 18. Java Development Tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  19. 19. Get the C O D E ๏ GitHub ๏ http://bit.ly/ SparkJavaCookbookCode https://github.com/jgperrin/net.jgp.labs.spark git clone https://github.com/jgperrin/net.jgp.labs.spark.git
  20. 20. Getting Deeper ๏ Go to net.jgp.labs.spark.l000_ingestion.l000_csv ๏ Open CsvToDatasetApp.java ๏ Right click, Run As, Java Application
  21. 21. Working directory = /Users/jgp/git/net.jgp.labs.spark +---+---+ |_c0|_c1| +---+---+ | 1| 5| | 2| 13| | 3| 27| | 4| 39| | 5| 41| | 6| 55| +---+---+
  22. 22. package net.jgp.labs.spark.l000_ingestion.l000_csv; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDatasetApp { public static void main(String[] args) { System.out.println("Working directory = " + System.getProperty("user.dir")); CsvToDatasetApp app = new CsvToDatasetApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); String filename = "data/tuple-data-file.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true") .option("header", "false") .load(filename); df.show(); } }
  23. 23. So what happened? Let’s try to understand a little more
  24. 24. Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) Apache Spark
  25. 25. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  26. 26. Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  27. 27. Title Text Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  28. 28. A bit of Analytics But really just a bit
  29. 29. Basic Analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  30. 30. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); authorsDf.show();
  31. 31. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); booksDf.show(); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf.show(); libraryDf.printSchema(); } }
  32. 32. The Art of Delegating
  33. 33. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  34. 34. Conclusion
  35. 35. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  36. 36. What You Learned ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun
  37. 37. Going Further ๏ Run more code from the examples (I add some weekly) ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ Watch for my book on Spark + Java to come!
  38. 38. Thanks @jgperrin

×