Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Spark meetup TCHUG

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 53 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Spark meetup TCHUG (20)

Aktuellste (20)

Anzeige

Spark meetup TCHUG

  1. 1. LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP FRANK SCHILDER SEPTEMBER 22, 2014
  2. 2. THOMSON REUTERS • The Thomson Reuters Corporation – 50,000+ employees – 2,000+ journalists at news desks world wide – Offices in more than 1,000 countries – $12 billion dollars revenue/year • Products: intelligent information for professionals and enterprises – Legal: WestlawNext legal search engine – Financial: Eikon financial platform; Datastream real-time share price data – News: REUTERS news – Science: Endnote, ISI journal impact factor, Derwent World Patent Index – Tax & Accounting: OneSource tax information • Corporate R&D – Around 40 researchers and developers (NLP, IR, ML) – Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC and London – We are hiring… email me at frank.schilder@thomsonreuters.com
  3. 3. OVERVIEW • Speed – Data locality, scalability, fault tolerance • Ease of Use – Scala, interactive Shell • Generality – SparkSQL, MLLib • Comparing ML frameworks – Vowpal Wabbit (VW) – Sparkling Water • The Future
  4. 4. WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing. • Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD) • Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell • Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning) • Integrated with Hadoop: runs on Hadoop 2’s YARN cluster
  5. 5. ACKNOWLEDGMENTS • Matei Zaharia and ampLab and databricks team for fantastic learning material and tutorials on Spark • Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments • Adam Glaser for his time as a TSAP intern • Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)
  6. 6. SPEED
  7. 7. PRIMARY GOALS OF SPARK • Extend the MapReduce model to better support two common classes of analytics apps: – Iterative algorithms (machine learning, graphs) – Interactive data mining (R, Python) • Enhance programmability: – Integrate into Scala programming language – Allow interactive use from Scala interpreter – Make Spark easily accessible from other languages (Python, Java)
  8. 8. MOTIVATION • Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (machine learning, graphs) – Interactive data mining tools (R, Python) • With current frameworks, apps reload data from stable storage on each query
  9. 9. HADOOP MAPREDUCE VS SPARK
  10. 10. SOLUTION: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications
  11. 11. PROGRAMMING MODEL Resilient distributed datasets (RDDs) – Immutable, partitioned collections of objects – Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage – Functions follow the same patterns as Scala operations on lists – Can be cached for efficient reuse 80+ Actions on RDDs – count, reduce, save, take, first, …
  12. 12. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Val lines = spark.textFile(“hdfs://...”) Val errors = lines.filter(_.startsWith(“ERROR”)) Val messages = errors.map(_.split(‘t’)(2)) Val cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker results Worker Worker Driver cachedMsgs.filter(_.contains(“timeout”)).count cachedMsgs.filter(_.contains(“license”)).count . . . tasks Cache 1 Cache 2 Cache 3 Action Result: scaled to 1 TB data in 5-7 sec Result: full-text search of Wikipedia in <1 sec (vs 170 sec for on-disk data) (vs 20 sec for on-disk data)
  13. 13. BEHAVIOR WITH NOT ENOUGH RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory
  14. 14. RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  15. 15. Fault Recovery Results 119 No Failure Failure in the 6th Iteration 57 56 58 58 81 57 59 57 59 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 Iteratrion time (s) Iteration
  16. 16. EASE OF USE
  17. 17. INTERACTIVE SHELL • Data analysis can be done in the interactive shell. – Start from local machine or cluster – Access multi-core processor with local[n] – Spark context is already set up for you: SparkContext sc • Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.): • Start analyzing your data: Processing starts here Local data file
  18. 18. ANALYZE YOUR DATA • Word count in one line: • List the word counts: • Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:
  19. 19. RUN A SPARK SCRIPT
  20. 20. PYTHON SHELL & IPYTHON • The interactive shell can also be started as Python shell called pySpark: • Start analyzing your data in python now: • Since it’s Python, you may want to use iPython – (command shell for interactive programming in your brower) :
  21. 21. IPYTHON AND SPARK • The iPython notebook environment and pySpark: – Document data analysis results – Carry out machine learning experiments – Visualize results with matplotlib or other visualization libraries – Combine with NLP libraries such as NLTK • PySpark does not offer the full functionality of Spark Shell in Scala (yet) • Some bugs (e.g. problems with unicode)
  22. 22. PROJECTS AT R&D USING SPARK • Entity linking – Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned) • Large-scale text data analysis: – creating fingerprints for entities/events – Temporal slot filling: Assigning a begin and end time stamp to a slot filler (e.g. A is employee of company B from BEGIN to END) – Large-Scale text classification of Reuters News Archive articles (10 years) • Language model computation used for search query analysis
  23. 23. SPARK MODULES • Spark streaming: – Processing real-time data streams • Spark SQL: – Support for structured data (JSON, Parquet) and relational queries (SQL) • MLlib: – Machine learning library • GraphX: – New graph processing API
  24. 24. SPARKSQL
  25. 25. SPARK SQL • Relational queries expressed in – SQL – HiveQL – Scala Domain specific language (DSL) • New type of RDD: SchemaRDD : – RDD composed of Row objects – Schema definition or inferred from a Parquet file, JSON data set, or data store in Hive • SPARK SQL is in alpha: API may change in the future!
  26. 26. DEFINING A SCHEMA
  27. 27. MLLIB
  28. 28. MLLIB • A machine learning module that comes with Spark • Shipped since Spark 0.8.0 • Provides various machine learning algorithms for classification and clustering • Sparse vector representation since 1.0.0 • New features in recently released version 1.1.0: – Includes a standard statistics library (e.g. correlation, Hypothesis testing, sampling) – More algorithms ported to Java and Python – More feature engineering: TF-IDF, Singular Value Decomposition (SVD)
  29. 29. MLLIB • Provides various machine learning algorithms: – Classification: • Logistic regression, support vector machine (SVM), naïve Bayes, decision trees – Regression: • Linear regression, regression trees – Collaborative Filtering: • Alternative least square (ALS) – Clustering: • K-means – Decomposition • Singular value decomposition (SVD), Principal component analysis (PCA)
  30. 30. OTHER ML FRAMEWORKS • Mahout • LIBLINEAR • MatLAB • Scikit-learn • GraphLab • R • Weka • Vowpal Wabbit • BigML
  31. 31. LARGE-SCALE ML INFRASTRUCTURE • More data implies bigger training sets and richer feature sets. • More data with simple ML algorithm often beats small data with complicated ML algorithm • Large-scale ML requires big data infrastructure: – Faster processing: Hadoop, Spark – Feature engineering: Principal Component Analysis, Hashing trick, Word2Vec
  32. 32. PREDICTIVE ANALYTICS WITH MLLIB
  33. 33. PREDICTIVE ANALYTICS WITH MLLIB http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured- data-using-spark-2.html
  34. 34. VW AND MLLIB COMPARISON • We compared Vowpal Wabbit and MLlib in December 2013 (work with Tom Vacek) • Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft) • Task: binary text classification task on Reuters articles – Ease of implementation – Feature Extraction – Parameter tuning – Speed – Accessibility of programming languages
  35. 35. VW VS. MLLIB • Ease of implementation – VW: user tool designed for ML, not programming language – MLlib: programming language, some support now (e.g. regularization) • Feature Extraction – VW: specific capabilities for bi-grams, prefix etc. – MLlib: no limit in terms of creating features • Parameter tuning – VW: no parameter search capability, but multiple parameters can be hand-tuned – MLlib: offers cross-validation • Speed – VW: highly optimized, very fast even on a single machine with multiple cores – MLlib: fast with lots of machines • Accessibility of programming languages – VW: written in C++, a few wrappers (e.g. Python) – MLlib: Scala, Python, Java • Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)
  36. 36. FINDINGS SO FAR • Large-scale extraction is a great fit for Spark when working with large data sets (> 1GB) • Ease of use makes Spark an ideal framework for rapid prototyping. • MLlib is a fast growing ML library, but “under development” • Vowpal Wabbit has been shown to crunch even large data sets with ease. 250 200 150 100 50 0 vw liblinear Spark local[4] 0/1 loss time
  37. 37. OTHER ML FRAMEWORKS • Internship by Adam Glaser compared various ML frameworks with 5 standard data sets (NIPS) – Mass-spectrometric data (cancer), handwritten digit detection, Reuters news classification, synthetic data sets – Data sets were not very big, but had up to 1.000.000 features • Evaluated accuracy of the generated models and speed for training time • H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.
  38. 38. ACCURACY
  39. 39. SPEED
  40. 40. WHAT IS NEXT? • Oxdata plans to release Sparkling Water in October 2014: • Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface • GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.
  41. 41. CAN’T DECIDE?
  42. 42. CONCLUSIONS
  43. 43. CONCLUSIONS • Apache Spark is the most active project in the Hadoop eco system • Spark offers speed and ease of use because of – RDDs – Interactive shell and – Easy integration of Scala, Java, Python scripts • Integrated in Spark are modules for – Easy data access via SparkSQL – Large-scale analytics via MLlib • Other ML frameworks enable analytics as well • Evaluate which framework is the best fit for your data problem
  44. 44. THE FUTURE? • Apache Spark will be a unified platform to run under various work loads: – Batch – Streaming – Interactive • And connect with different runtime systems – Hadoop – Cassandra – Mesos – Cloud – …
  45. 45. THE FUTURE? • Spark will extend its offering of large-scale algorithms for doing complex analytics: – Graph processing – Classification – Clustering – … • Other frameworks will continue to offer similar capabilities. • If you can’t beat them, join them.
  46. 46. http://labs.thomsonreuters.com/about-rd-careers/ FRANK.SCHILDER@THOMSONREUTERS.COM
  47. 47. EXTRA SLIDES
  48. 48. Example: Logistic Regression Goal: find best line separating two sets of points + – – + + + + + + + + – – – – – – – + target – random initial line
  49. 49. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  50. 50. Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s
  51. 51. Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles join groupBy union map Stage 3 A: Stage 1 Stage 2 B: C: D: E: F: G: = cached data partition
  52. 52. Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

×