Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Harnessing the power of Nutch with Scala

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Wird geladen in …3
×

Hier ansehen

1 von 26 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Harnessing the power of Nutch with Scala (20)

Anzeige

Weitere von Knoldus Inc. (20)

Aktuellste (20)

Anzeige

Harnessing the power of Nutch with Scala

  1. 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  2. 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  3. 3. nutch Web search crawler link-graph parsing software solr lucene 3
  4. 4. nutch – but we have google! transparent understanding extensible 4
  5. 5. nutch – basic architecture crawler searcher 5
  6. 6. nutch - architecture Recursive segments crawler links web database pages fetchlists Crawl db 6
  7. 7. nutch – crawl cycle generate – fetch – update cycle Create crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  8. 8. nutch - plugins generate – fetch – update cycle Create crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  9. 9. nutch – extension points plugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  10. 10. nutch - example <plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter "></implementation> </extension> </plugin> 10
  11. 11. public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  12. 12. scala I have Java ! concurrency verbose popular Strongly typed jvm OO library 12
  13. 13. scala Java: class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; } } Scala: class Person(var firstName: String, var lastName: String, var age: Int) Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  14. 14. scala Java – everything is an object unless it is primitive Scala – everything is an object. period. Java – has operators (+, -, < ..) and methods Scala – operators are methods Java – statically typed – Thing thing = new Thing() Scala – statically typed but uses type inferencing val thing = new Thing 14
  15. 15. evolution 15
  16. 16. scala and concurrency Fine grained coarse grained Actors 16
  17. 17. actors 17
  18. 18. 18
  19. 19. problem context Aggregator UGC 19
  20. 20. solution Supplier 1 Aggregator Supplier 2 Supplier 3 20
  21. 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  22. 22. logic Crawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  23. 23. plugin - scala class DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  24. 24. result 5 suppliers crawled Crawl cycles run continuously for few days > 500K seed data collected All with Nutch and 823 lines of Scala code 24
  25. 25. demo in action …. 25
  26. 26. resources http://blog.knoldus.com http://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26

×