Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Distributed processing of large graphs in python

662 Aufrufe

Veröffentlicht am

Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.

Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.

A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Distributed processing of large graphs in python

  1. 1. Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada
  2. 2. • Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  3. 3. DSR participants do a portfolio project
  4. 4. Why is DSR talking about Scala/Spark? They are b IBM is behind this They hired
  5. 5. What is a good question?
  6. 6. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  7. 7. Does he look like a bitch?
  8. 8. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  9. 9. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  10. 10. What is a good question? • Business case
  11. 11. DJ J & MAX RECORDS
  12. 12. DJ J & MAX RECORDS
  13. 13. DJ J & MAX RECORDS
  14. 14. DJ J & MAX RECORDS
  15. 15. DJ J & MAX RECORDS
  16. 16. DJ J & MAX RECORDS
  17. 17. Overlap Tweet hours Tweet frequency per UTC hour
  18. 18. What is a good question? • Business case • Data available
  19. 19. 24GB
  20. 20. What is a good question? • Business case • Data available • Technology to answer the question is available
  21. 21. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  22. 22. Graph theory parts we can use to solve this problem
  23. 23. Graph theory primer • Random walk • Shortest path • Sampling
  24. 24. Sampling in networks
  25. 25. Sampling in Networks Note that sampling in Networks is fraught with difficulties. One cannot simply sample the edges and nodes and expect that the sample be representative of the original network. In the graph below, a sample that missed node 1 or 2 would disconnect the two clusters, and would not have the same properties as the original Node 11 Node 2
  26. 26. Random surfer
  27. 27. Random surfer A B C D
  28. 28. Random surfer A B C D
  29. 29. Random surfer A B C D E Visited more often: • Nodes with many links • Coming from frequently visited nodes
  30. 30. Computing Pagerank     A B C D E  
  31. 31. Computing Pagerank     A B C D E    
  32. 32. Computing Pagerank     A B C D E    
  33. 33. Computing Pagerank     A B C D E    
  34. 34. Computing Pagerank     A B C D E    
  35. 35. Computing Pagerank     A B C D E    
  36. 36. Teleport A B C D E
  37. 37. Teleport A B C D E
  38. 38. Teleport A B C D E
  39. 39. Teleport A B C D E               
  40. 40. Teleport A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α)             (1-α) α
  41. 41. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node           (1-α)
  42. 42. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node (1-α) α
  43. 43. Personalized pagerank • Special case of Pagerank with priors (distribution of weights over the nodes)
  44. 44. Implementation
  45. 45. A partitioned, distributed graph processing engine is significantly more complex and difficult to build
  46. 46. GraphX and graphframes (new in spark 2.0) • GraphX is to RDD as graphframe is to dataframe • GraphX is lower level, and the API is scala-only. Graphframe is very new: • It’s not designed to be a graph database, as neo4J. Nodes and edges can contain metadata, but the query engine is not as complete as cypher
  47. 47. Advantages of graphframes • Graphframes have a python API • Graphframes give you simple querying for free.  GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries • They contain most of the algorithms in graphX, but the API is less well-tested • Pyspark shell instead of spark-shell
  48. 48. Distributed PageRank • Problem: Computing PageRank on graph too large for one machine • Algorithm: – Shard edges randomly, – compute on each machine – average results • Basic idea: Duplicate edges from low-degree nodes. Gives an unbiased estimator
  49. 49. • Nodes: 41.652.230 • Edges: 1.468.365.182
  50. 50. Summary of implementation, benefits • Graph theory is a really flexible way to represent a problem • Data structures to represent graphs are mature • You can do now out-of-core, distributed graph analysis for cheap • Implementations are there for even state-of-the-art methods
  51. 51. Summary, finding a problem • We live in an age of abundance (methods, data, hardware, ideas) • Finding the question is more than half of the battle • I had about a week to prepare this talk, but I managed to put together something that showcases what you can do with large graphs today, and it could be effective as a startup idea • My question is not great because you cannot demonstrate that it works till you use it (common problem for unsupervised methods)
  52. 52. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  53. 53. References: Drawing graphs • Graphs in this slide set have been drawn with Gephi • If you use Zeppelin notebook, you can draw graphs with: drawGraph(org.apache.spark.graphx.util. GraphGenerators.rmatGraph(sc,32,60)) 

  54. 54. 25 videos explaining ML on spark, 50 more to come. A bunch on graphX • For people who already know ML • http://datascienceretreat.com/videos/data-science-with- scala-and-spark
  55. 55. About learning new tech over seven weekends…
  56. 56. About learning new tech over seven weekends • You have time and enjoy using it to learn alone: learn it ‘the hard way’ • You are extremely motivated and talented, have money: Apply for DSR • You want your weekends for yourself. You are already very good but want to switch jobs. Apply for codekitt
  57. 57. Thanks! Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada http://datascienceretreat.com/ codekitt.com

×