Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Sparkly Notebook: Interactive Analysis and Visualization with Spark

28.862 Aufrufe

Veröffentlicht am

Sparkly Notebook: Interactive Analysis and Visualization with Spark - presented Apr 15, 2015 at Seattle Spark Meetup
http://www.meetup.com/Seattle-Spark-Meetup/events/208711962/

Veröffentlicht in: Daten & Analysen
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/qURD } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/qURD } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/qURD } ......................................................................................................................... Download doc Ebook here { https://soo.gd/qURD } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Sparkly Notebook: Interactive Analysis and Visualization with Spark

  1. 1. SPARKLY NOTEBOOK: INTERACTIVE ANALYSIS AND VISUALIZATION WITH SPARK FELIX CHEUNG APRIL 2015 HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
  2. 2. SETUP • Spark on CDH cluster • Vagrant - 2-nodes - custom provisioning
  3. 3. AGENDA • IPython + PySpark cluster • Zeppelin • Spark’s Streaming k-means • Lightning
  4. 4. SPARK - 10 SEC INTRODUCTION • Spark • Spark SQL + Data Frame + data source • Spark Streaming • MLlib • GraphX
  5. 5. It’s a lot of time looking at data..
  6. 6. REPL • Read-Eval-Print-Loop
  7. 7. Set of REPL related to Spark…
  8. 8. $  spark-­‐shell   Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _  /  _  /  _  `/  __/    '_/        /___/  .__/_,_/_/  /_/_      version  1.2.0-­‐SNAPSHOT              /_/   Using  Scala  version  2.10.4  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  1.7.0_67)   Type  in  expressions  to  have  them  evaluated.   Type  :help  for  more  information.   15/04/15  11:31:28  INFO  SparkILoop:  Created  spark  context..   Spark  context  available  as  sc.   scala>  val  a  =  sc.parallelize(1  to  100)   a:  org.apache.spark.rdd.RDD[Int]  =  ParallelCollectionRDD[0]  at  parallelize  at  <console>:12   scala>  a.collect.foreach(x  =>  println(x))   1   2   3   4
  9. 9. GOOD • See results instantly
  10. 10. NOT SO GOOD • Ok as an IDE • No Save / Repeat • No visualization
  11. 11. NOTEBOOK
  12. 12. Jupyter IPython will continue to exist as a Python kernel for Jupyter, but the notebook and other language-agnostic parts of IPython will move to new projects under the Jupyter name. IPython 3.0 will be the last monolithic release of IPython. ! “IPython” http://ipython.org/ • interactive shell • browser-based notebook • 'Kernel' • great support for visualization library (eg. matplotlib) • built on pyzmq, tornado IPYTHON/JUPYTER
  13. 13. IPYTHON NOTEBOOK
 NOTEBOOK == BROWSER-BASED REPL IPython Notebook is a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.
  14. 14. MATPLOTLIB matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code, with familiar MATLAB APIs. plt.barh(y_pos,  performance,  xerr=error,   align='center',  alpha=0.4)   plt.yticks(y_pos,  people)   plt.xlabel('Performance')   plt.title('How  fast  do  you  want  to  go  today?')   plt.show()
  15. 15. PYSPARK • Spark on Python, this serves as the Kernel, integrating with IPython • Each notebook spins up a new instance of the Kernel (ie. PySpark running as the Spark Driver, in different deploy mode Spark/PySpark supports)
  16. 16. (All notebook examples are a subset of those in the Meetup reconstructed here)
  17. 17. Markdown
  18. 18. Spark in Python
  19. 19. Source: http://nbviewer.ipython.org/github/ResearchComputing/ scientific_computing_tutorials/blob/master/spark/02_word_count.ipynb
  20. 20. WORD2VEC EXAMPLE Word2Vec computes distributed vector representation of words. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.
 https://code.google.com/p/word2vec/ Spark MLlib implements the Skip-gram approach. With Skip-gram we want to predict a window of words given a single word.
  21. 21. WORD2VEC DATASET Wikipedia dump http://mattmahoney.net/dc/ textdata grep  -­‐o  -­‐E  'w+(W+w+){0,15}'  text8  >  text8_lines   then randomly sampled to ~200k lines
  22. 22. matplotlib: http://matplotlib.org Seaborn: http://stanford.edu/~mwaskom/software/seaborn/ Bokeh: http://bokeh.pydata.org/en/latest/ MORE VISUALIZATIONS Seaborn Bokeh matplotlib
  23. 23. SETUP To setup IPython • Python 2.7.9 (separate from CentOS default 2.6.6), on all nodes • matplotlib, on the host running IPython To run IPython with the PySpark Kernel, set these in the environment
 (Please check out my handy script on github) ! ! ! PYSPARK_PYTHON command to run python, eg. “python2.7” PYSPARK_DRIVER_PYTHON command to run ipython PYSPARK_DRIVER_PYTHON_OPTS “notebook —profile” PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode YARN_CONF_DIR if YARN mode LD_LIBRARY_PATH for matplotlib
  24. 24. IPYTHON/JUPYTER KERNELS • IPython • IGo • Bash • IR • IHaskell • IMatlab • ICSharp • IScala • IRuby • IJulia .. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other- languages
  25. 25. ZEPPELIN
  26. 26. Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable. Zeppelin has been incubating since Dec 2014.
 https://zeppelin.incubator.apache.org/
  27. 27. shell script &
 calling library package Load and process data
 with Spark
  28. 28. SQL query powered by Spark SQL -
 progress &
 parameterization via dynamic form
  29. 29. Python &
 data passing across languages (interpreters)
  30. 30. ZEPPELIN ARCHITECTURE Realtime collaboration - enabled by websocket communications Frontend: AngularJS 
 Backend server: Java 
 Interpreters: Java
 Visualization: NVD3
  31. 31. INTERPRETERS • Spark group • Spark (Scala) • PySpark • Spark SQL • Dependency • Markdownjs • Shell • Hive • Coming: jdbc, Tajo, etc.
  32. 32. CLUSTERING • Clustering tries to find natural groupings in data. It puts objects into groups in which those within a group are more similar to each other than to those in other groups. • Unsupervised learning
  33. 33. K-MEANS • First, given an initial set of k cluster centers, we find which cluster each data point is closest to • Then, we compute the average of each of the new clusters and use the result to update our cluster centers
  34. 34. K-MEANS|| IN MLLIB • a parallelized variant of the k-means++
 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf Parameters: • k is the number of desired clusters. • maxIterations is the maximum number of iterations to run. • initializationMode specifies either random initialization or initialization via k-means||. • runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result). • initializationSteps determines the number of steps in the k-means|| algorithm. • epsilon determines the distance threshold within which we consider k- means to have converged.
  35. 35. CASE STUDY:
 K-MEANS - ZEPPELIN
  36. 36. Details on github at: http://bit.ly/1JWOPh8 ANOMALY DETECTION WITH K-MEANS Using Spark DataFrame, csv data source, to process KDDCup’99 data
 Scoring with different k values
  37. 37. COMING SOON (NOW!)
  38. 38. Realtime updates
  39. 39. Dashboard
  40. 40. Spark-notebook: https://github.com/andypetrella/spark-notebook ISpark: https://github.com/tribbloid/ISpark Spark Kernel: https://github.com/ibm-et/spark-kernel Jove: https://github.com/jove-sh/jove-notebook Beaker: https://github.com/twosigma/beaker-notebook OTHER NOTEBOOKS • Spark-notebook • ISpark • Spark Kernel • Jove Notebook • Beaker • Databricks Cloud notebook
  41. 41. PART 2 STREAMING K-MEANS
  42. 42. WHY STREAMING? • Train - model - predict works well on static data • What if data is • Coming in streams • Changing over time?
  43. 43. STREAMING K-MEANS DESIGN • Proposed by Dr Jeremy Freeman (here)
  44. 44. STREAMING K-MEANS • key concept: forgetfulness • balances the relative importance of new data versus past history • half-life • time it takes before past data contributes to only one half of the current model
  45. 45. STREAMING K-MEANS • time unit • batches (which have a fixed duration in time), or points • eliminate dying clusters

  46. 46. VISUALIZING
 STREAMING K-MEANS - LIGHTNING
  47. 47. LIGHTNING
  48. 48. • Lightning - data visualization server
 http://lightning-viz.org • provides API-based access to reproducible, web- based, interactive visualizations. It includes a core set of visualization types, but is built for extendability and customization. Lightning supports modern libraries like d3.js and three.js, and is designed for interactivity over large data sets and continuously updating data streams. VISUALIZING STREAMING K- MEANS ON IPYTHON + LIGHTNING
  49. 49. RUNNING LIGHTNING • API: node.js, Python, Scala • Extension support for custom chart (eg. d3.js) • Requirements: • Postgres recommended (SQLlite ok) • node.js (npm , gulp)
  50. 50. The Freeman Lab at Janelia Research Campus uses Lightning to visualize large-scale neural recordings from zebrafish, in collaboration with the Ahrens Lab
  51. 51. SPARK STREAMING K-MEANS DEMO Environment • requires: numpy, scipy, scikit-learn • IPython/Python requires: lightning-python package Demo consists of 3 parts:
 https://github.com/felixcheung/spark-ml-streaming • Python driver script, data generator • Scala job - Spark Streaming & Streaming k-means • IPython notebook to process result, visualize with Lightning
 Originally this was part of the Python driver script - it has been modified for this talk to run within IPython
  52. 52. CHALLENGES • Package management • Version/build conflicts!
  53. 53. YOU CAN RUN THIS TOO! • Notebooks available at http://bit.ly/1JWOPh8 • Everything is heavily scripted and automated
 Vagrant config for local, virtual environment available at http://bit.ly/1DB3OLw
  54. 54. QUESTION? ! https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI !

×