Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Hadoop and beyond: power tools for data mining

9.301 Aufrufe

Veröffentlicht am

A brief survey of great tools for dealing with big datasets. Given as an invited lecture for students taking the Cloud Computing module at Birkbeck and UCL.

  • Sex in your area is here: ❶❶❶ http://bit.ly/39pMlLF ❶❶❶
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating direct: ♥♥♥ http://bit.ly/39pMlLF ♥♥♥
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks }} ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Hadoop and beyond: power tools for data mining

  1. 1. Hadoop and beyond:power tools for data mining Mark Levy, 13 March 2013 Cloud Computing Module Birkbeck/UCL
  2. 2. Hadoop and beyondOutline: • the data I work with • Hadoop without Java • Map-Reduce unfriendly algorithms • Hadoop without Map-Reduce • alternatives in the cloud • alternatives on your laptop
  3. 3. NB• all software mentioned is Open Source• wont cover key-value stores• I dont use all of these tools
  4. 4. Last.fm: scrobbling
  5. 5. Last.fm: scrobbling
  6. 6. Last.fm: tagging
  7. 7. Last.fm: personalised radio
  8. 8. Last.fm: recommendations
  9. 9. Last.fm: recommendations
  10. 10. Last.fm datasetsCore datasets: • 45M users, many active • 60M artists • 100M audio fingerprints • 600M tracks (hmm...) • 19M physical recordings • 3M distinct tags •  2.5M <user,item,tag> taggings per month •  1B <user,time,track> scrobbles per month • full user-track graph has ~50B edges  (more often work with ~500M edges)
  11. 11. Problem Scenario 1Need Hadoop, dont want Java: • need to build prototypes, fast • need to do interactive data analysis • want terse, highly readable code • improve maintainability • improve correctness
  12. 12. Hadoop without Java Some options: • Hive (Yahoo!) • Pig (Yahoo!) • Cascading (ok its still Java...) • Scalding (Twitter) • Hadoop streaming (various)not to mention 11 more listed here:http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
  13. 13. Apache HiveSQL access to data on Hadooppros: • minimal learning curve • interactive shell • easy to check correctness of codecons: • can be inefficient • hard to fix when it is
  14. 14. Word count in HiveCREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH /input OVERWRITE INTO TABLE input;SELECT word, COUNT(*) FROM inputLATERAL VIEW explode(split(text, )) wTable as wordGROUP BY word;[but would you use SQL to count words?]
  15. 15. Apache PigHigh level scripting language for Hadooppros: • more primitive operations than Hive (and UDFs) • more flexible than Hive • interactive shellcons: • harder learning curve than Hive • tempting to write longer programs but no code modularity beyond functions
  16. 16. Word count in PigA = load /input;B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = filter B by word matches w+;D = group C by word;E = foreach D generate COUNT(C), group;store E into /output/wordcount;[ apply operations to "relations" (tuples) ]
  17. 17. CascadingJava data pipelining for Hadooppros: • as flexible as Pig • uses a real programming langauge • ideal for longer workflowscons: • new concepts to learn ("spout","sink","tap",...) • still verbose (full wordcount ex. code > 150 lines)
  18. 18. Word count in CascadingScheme sourceScheme = new TextLine(new Fields("line"));Tap source = new Hfs(sourceScheme, "/input");Scheme sinkScheme = new TextLine(new Fields("word", "count"));Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE);Pipe assembly = new Pipe("wordcount");String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";Function function = new RegexGenerator(new Fields("word"), regex);assembly = new Each(assembly, new Fields("line"), function);assembly = new GroupBy(assembly, new Fields("word"));Aggregator count = new Count(new Fields("count"));assembly = new Every(assembly, count);Properties properties = new Properties();FlowConnector.setApplicationJarClass(properties, Main.class);FlowConnector flowConnector = new FlowConnector(properties);Flow flow = flowConnector.connect("word-count", source, sink, assembly);flow.complete();
  19. 19. ScaldingScala data pipelining for Hadooppros: • as flexible as Pig • uses a real programming language • much terser than Javacons: • community still small (but in use at Twitter) • ???
  20. 20. Word count in Scaldingimport com.twitter.scalding._class WordCountJob(args : Args) extends Job(args) { TextLine(args("input")) .flatMap(line -> word){ line: String => line.split("""s+""") } .groupBy(word){ _.size } .write(Tsv(args("output")))}[and a one-liner to run it]
  21. 21. Hadoop streamingMap-reduce in any languagee.g. Dumbo wrapper for Pythonpros: • use your favourite language for map-reduce • easy to mix local and cloud processingcons: • limited community • limited functionality beyond map-reduce
  22. 22. Word count in Dumbodef map(key,text): for word in text.split(): yield word,1 # ignore keydef reduce(word,counts): yield word,sum(counts)import dumbodumbo.run(map,reduce,combiner=reduce)[and a one-liner to run it]
  23. 23. Problem Scenario 1bNeed Hadoop, dont want Java: • drive native code in parallelE.g. audio analysis for: • beat locations, bpm • key estimation • chord sequence estimation • energy • music/speech? • ...
  24. 24. Audio AnalysisProblem: • millions of audio tracks on own dfs • long-running C++ analysis code • depends on numerous libraries • verbose output
  25. 25. Audio AnalysisSolution: • bash + Dumbo Hadoop streamingOutline: • build C++ code • zip up binary and libs • send zipfile and some track IDs to each machine • extract and run binary in map task with  subprocess.Popen()
  26. 26. Audio Analysisclass AnalysisMapper: init(): extract(analyzer.tar.bz2,”bin”) map(key,trackID): file = fetch_audio_file(trackID) proc = subprocess.Popen( [“bin/analyzer”,file], stdout = subprocess.PIPE) (out,err) = proc.communicate() yield trackID,out
  27. 27. Problem Scenario 2Map-reduce unfriendly computation: • iterative algorithms on same data • huge mapper output ("map-increase") • curse of slowest reducer
  28. 28. Graph RecommendationsRandom walk on user-item graph  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  29. 29. Graph RecommendationsMany short routes from U to t ⇒ recommend!  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  30. 30. Graph Recommendationsrandom walk is equivalent to • Label Propagation (Baluja et al., 2008) • belongs to family of algorithms that  are easy to code in map-reduce
  31. 31. Label PropagationUser-track graph, edge weights = scrobbles: 2 4a   4 4  4 U 4 b 4  1 1 4 c  4  V 2 3  4 d 4  5 W 3 4  4 e  3 4  f4  4 X
  32. 32. Label Propagation User nodes are labelled with scrobbled tracks: 2 4  4 a (a,0.2)(b,0.4) 4(c,0.4)  4 U 4 b 4  1(b,0.5)(d,0.5) 1 c4  4  V 2(b,0.2) 3  4 d 4 (d,0.3) 5(e,0.5) W 3  4 e4  3(a,0.3)(d,0.3) 4(e,0.4)  f4  4 X
  33. 33. Label Propagation Propagate, accumulate, normalise: 2 4  4 a (a,0.2)(b,0.4) 4(c,0.4)  4 U 4 b 4  1(b,0.5)(d,0.5) 1 c4  4  V 2 1 x (b,0.5),(d,0.5)(b,0.2) 3  4 d 4  x (b,0.2),(d,0.3),(e,0.5) 3(d,0.3) 5 Þ(b,0.37),d(0.47),(e,0.17)(e,0.5) W 3  4 e4  3 next iteration e will(a,0.3) propagate to user V(d,0.3) 4(e,0.4)  f4  4 X
  34. 34. Label PropagationAfter some iterations: •  labels at item nodes = similar items •  new labels at user nodes = recommendations
  35. 35. Map-Reduce GraphAlgorithmsgeneral approach assuming: • no global state • state at node recomputed from scratch  from incoming messages on each iterationother examples: • breadth-first search • page rank
  36. 36. Map-Reduce GraphAlgorithmsinputs: • adjacency lists, state at each nodeoutput: • updated state at each node 2  4 a4  4 U 4  U,[(a,2),(b,4),(c,4)]  b4 4  c4  4 adjacency list for node U
  37. 37. Label Propagationclass PropagatingMapper: map(nodeID,value): # value holds label-weight pairs # and adjacency list for node labels,adj_list = value for node,weight in adj_list: # send a “stripe” of label-weight # pairs to each neighbouring node msg = [(label,prob*weight) for label,prob in labels] yield node,msg
  38. 38. Label Propagationclass Reducer: reduce(nodeID,msgs): # accumulate labels = defaultdict(lambda:0) for msg in msgs: for label,w in msg: labels[label] += w # normalise, prune normalise(labels,MAX_LABELS_PER_NODE) yield nodeID,labels
  39. 39. Label PropagationNot map-reduce friendly: •  send graph over network on every iteration •  huge mapper output: • mappers soon send MAX_LABELS_PER_NODE updates along every edge •  some reducers receive huge input: • too slow if reducer streams the data, OOM otherwise •  NB cant partition real graphs to avoid this • many natural graphs are scale-free e.g. AltaVista web graph top 1% of nodes adjacent to 53% of edges
  40. 40. Problem Scenario 2bMap-reduce unfriendly computation: • shared memoryExamples: • almost all machine learning: • split training examples between machines • all machines need to read/write many shared parameter values
  41. 41. Hadoop without map-reduceGraph processing • Apache Giraph (Facebook)Hadoop YARN • Knitting Boar, Iterative Reducehttp://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html • ???
  42. 42. Alternatives in the cloudGraph Processing: • GraphLab (CMU)Task-specific: • Yahoo! LDAGeneral: • HPCC • Spark (Berkeley)
  43. 43. Spark and SharkIn-memory cluster computingpros: •  fast!! (Shark is 100x faster than Hive) •  code in Scala or Java or Python •  can run on Hadoop YARN or Apache Mesos •  ideal for iterative algorithms, nearline analytics •  includes a Pregel clone & stream processingcons: •  hardware requirements???
  44. 44. GraphLabDistributed graph processingpros: •  vertex-centric programming model •  handles true web-scale graphs •  many toolkits already: • collaborative filtering, topic modelling, graphical models, machine vision, graph analysiscons: •  new applications require non-trivial C++ coding
  45. 45. Word count in Sparkval file = spark.textFile(“hdfs://input”)val counts = file.flatMap(line => line.split(”“)) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile(“hdfs://output/wordcount”)
  46. 46. Logistic regression in Sparkval points = spark.textFile(…).map(parsePoint).cache()var w = Vector.random(D) // current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient }println(“Final separating plane: “ + w)[ points remain in memory for all iterations ]
  47. 47. Alternatives on your laptopGraph processing • GraphChi (CMU)Machine learning • Sophia-ML (Google) • vowpal wabbit (Yahoo!, Microsoft)
  48. 48. GraphChiGraph processing on your laptoppros: •  still handles graphs with billions of edges •  graph structure can be modified at runtime •  Java/Scala ports under active development •  some toolkits available: • collaborative filtering, graph analysiscons: •  existing C++ toolkit code is hard to extend
  49. 49. vowpal wabbitclassification, regression, LDA, bandits, ...pros: •  handles huge ("terafeature") training datasets •  very fast •  state of the art algorithms •  can run in distributed mode on Hadoop streamingcons: •  hard-core documentation
  50. 50. Take homesThink before you use Hadoop •  use your laptop for most problems •  use a graph framework for graph dataKeep your Hadoop code simple •  if youre just querying data use Hive •  if not use a workflow frameworkCheck out the competition •  Spark and HPCC look impressive
  51. 51. Thanks for listening!Goodbye Hellogamboviol@gmail.com@gamboviol

×