SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Elastic Web Mining   01 November 2009




                                    1
Elastic Web Mining                                         01 November 2009




                     Web Mining in the Cloud
                     Hadoop/Cascading/Bixo in EC2



                     Ken Krugler, Bixo Labs, Inc.
                     ACM Silicon Valley Data Mining Camp
                     01 November 2009




                                                                          2
Elastic Web Mining                                                                         01 November 2009




                     About me

                          Background in vertical web crawl
                           – Krugle search engine for open source code
                           – Bixo open source web mining toolkit
                          Consultant for companies using EC2
                           – Web mining
                           – Data processing
                          Founder of Bixo Labs
                           – Elastic web mining platform
                           – http://bixolabs.com




             Over the prior 4 years I had a startup called Krugle, that provided code search
             for open source projects and inside large companies.


             We did a large, 100M page crawl of the “programmer’s web” to find out
             information about open source projects.


             Based on what I learned from that experience, I started the Bixo open source
             project.
             It’s a toolkit for building web mining workflows, and I’ll be talking more
             about that later.


             Several companies paid me to integrate Bixo into an existing data processing
             environment.


             And that in turn led to Bixo Labs, which is a platform for quickly creating
             creating web mining apps.
             Elastic means the size of the system can easily be changed to match the web
             mining task.




                                                                                                          3
Elastic Web Mining                                                                       01 November 2009




                     Typical Data Mining




             This is the world that many of you live in.
             Analyzing data to find important patterns.


             Here’s an example of output from the QlikView business intelligence tool
             It was used to help analyze the relative prevalence of keywords in two
             competing web sites.
             Here you see two word terms that often occur on McAfee’s site, but not on
             Symantec’s
             Which is very useful data for anybody who worries about search engine
             optimization.




                                                                                                        4
Elastic Web Mining                                                                   01 November 2009




                     Data Mining Victory!




             You all know about analyzing data to find important patterns that get
             managers all worked up…




                                                                                                    5
Elastic Web Mining                                                                 01 November 2009




                     Meanwhile, Over at McAfee…




             But how do you get to this point?
             How do you use the web as the source for data that you’re analyzing
             That’s what I’m going to be talking about here.




                                                                                                  6
Elastic Web Mining                                                                       01 November 2009




                     Web Mining 101

                       Extracting      & Analyzing Web Data
                       More     Than Just Search
                       Business      intelligence, competitive
                         intelligence, events, people, companies,
                         popularity, pricing, social graphs, Twitter
                         feeds, Facebook friends, support forums,
                         shopping carts…



             Quick intro to web mining, so we’re on the same page


             Most people think about the big search companies when they think about web
             mining.
             Search is clearly the biggest web mining category, and generates the most
             revenue.
             But other types of web mining have value that is high and growing.




                                                                                                        7
Elastic Web Mining                                                                          01 November 2009




                     4 Steps in Web Mining

                        Collect     - fetch content from web
                        Parse    - extract data from formats
                        Analyze      - tokenize, rate, classify, cluster
                        Produce      - “useful data”




             It’s common to confuse web crawling with fetching.
             Crawling is the process of automatically finding new pages by extracting links
             from fetched pages.
             But for many web mining applications, you have a “white list” of pre-defined
             URLs.
             In either case, though, you need to reliably, efficiently and politely fetch
             pages.


             Content comes in a variety of formats - typically HTML, but also PDF, word,
             zip archives, etc.
             Need to parse these formats to extract key data - typically text, but could be
             image data.


             Often the analyze step will include aspects of machine learning - classification,
             clustering.


             “useful data” covers a lot of ground, because there are a lot of ways to use the
             output of web mining.
             Generating an index is one of the most common, because people think about
             search as the goal.
             But for data mining, the end result at this point is often highly reduced data
             that is input to traditional data mining tools.                                               8
Elastic Web Mining                                                                          01 November 2009




                     Web Mining versus Data Mining

                        Scale    - 10 million isn’t a big number
                        Access     - public but restricted
                           – Special implicit rules apply

                        Structure     - not much




             What are the key differences between web mining and traditional data mining
             I’m saying “traditional” because the face of data mining is clearly changing.
             But if you look at most vendor tools, the focus is on what I’d call “traditional
             data mining”


             Scale - 10M is big for data mining, but not for web mining


             Access - with DM, once you defeated Mongor, keeper of data base access keys,
             you were golden
             Web pages are typically public, but it’s a shared resource so implicit rules
             apply.
             Like “don’t bring my web site to its knees”.
             Data mining breaks traditional implicit contract, so extra cautions apply.
             Implicit contract is that I let you crawl me, and you drive traffic to me when
             your search index goes live.
             But with DM, there often isn’t an index as the end result.


             With mining DBs, there’s explicit structure, which is mostly lacking from web
             pages.


                                                                                                           9
Elastic Web Mining                                                                         01 November 2009




                     How to Mine Large Scale Web Data?

                        Start    with scalable map-reduce platform
                        Add     a workflow API layer
                        Mix     in a web crawling toolkit
                        Write     your custom data processing code
                        Run     in an elastic cloud environment




             If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately
             want to process from the web


             If you can’t create real workflows, it will never be reliable or efficient.


             If you don’t use specialized web crawling code, you’ll get blacklisted


             Because you’re trying to distill down large data, there’s often some custom
             processing.


             If you don’t run it a cloud environment, you’ll be wasting money - and I’ll
             explain why in a few slides.




                                                                                                        10
Elastic Web Mining                                                                        01 November 2009




                     One Solution - the HECB Stack

                       Bixo

                       Cascading

                       Hadoop

                       EC2




             I’m focusing on one particular solution to the challenges of web mining that I
             just described.


             It’s the “HECB” stack.


             I’m going to talk about these from the bottom up, which is EC2 first, then
             Hadoop…but the acronym didn’t work as well.




                                                                                                       11
Elastic Web Mining                                                                        01 November 2009




                     EC2 - Amazon Elastic Compute Cloud

                        True     cost of non-cloud environment
                           –   Cost of servers & networking (2 year life)
                           –   Cost of colo (6 servers/rack)
                           –   Cost of OPS salary (15% of FTE/cluster)
                           –   Managing servers is no fun

                        Web      mining is perfect for the cloud
                           – “bursty” => savings are even greater
                           – Data is distilled, so no transfer $$$ pain



             At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster
             In the end, our actual utilization ratio was probably < 20%
             Even with close to 100% utilization, the break-even point for EC2 vs. colo is
             somewhere between 50 and 200 servers, depending on who you talk to.
             If utilization was 20%, then break even would be 250 to 1000 servers.


             Mining for search doesn’t work so well in this model - cluster should be
             always crawling (ABC) so not as bursty
             And transferring raw content, parse, and index will generate lots of transfer
             charges.
             But for web mining that’s focused on data mining, data is distilled so this isn’t
             an issue.




                                                                                                       12
Elastic Web Mining                                                                        01 November 2009




                     Why Hadoop?

                        Perfect     for processing lots of data
                           – Map-reduce
                           – Distributed file system
                        Open      source, large community, etc.
                        Runs     well in EC2 clusters
                        Elastic    Map Reduce as option




             Map-reduce - how do you parallelize the processing of lots of data so that you
             can
             Do the work on many servers? The answer is Map-reduce.


             HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner.
             How do you make sure the data (the big stuff) moves as little as possible
             during processing.
             The answer is the Hadoop distributed file system.


             It’s open source, so lots of support, consultants, rapid bug fixes, etc.


             Large companies are using it, especially Yahoo


             Elastic map reduce is a special service built on top of EC2, where it’s easier to
             run Hadoop jobs
             Because you have access to pre-configured Hadoop clusters, special tools, etc.




                                                                                                       13
Elastic Web Mining                                                                     01 November 2009




                     Why Cascading?

                        API    on top of Hadoop
                        Supports      efficient, reliable workflows
                        Reduces      painful low-level MR details
                        Build    workflow using “pipe” model




             If you ever had to write a complex workflow using Hadoop, you know the
             answer.
             It frees you from the lower-level details of thinking in map-reduce.
             You can think about the workflow as operations on records with fields.
             And in data mining, the workflow often winds up being very complex.


             Because you can build workflows out of a mix of pre-defined & custom pipes,
             it’s a real toolkit.


             Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels
             more like C++ :)


             Key aspect of reliable workflows is Cascading’s ability to check your
             workflow (the DAG it builds)
             Finds cases where fields aren’t available for operations.
             Solves a key problem we ran into when customizing Nutch at Krugle




                                                                                                    14
Elastic Web Mining                                                                        01 November 2009




                     Why Bixo?

                        Plugs     into Cascading-based workflow
                           – Scales with Hadoop cluster
                           – Rules well in EC2
                        Handles       grungy web crawling details
                           – Polite yet efficient fetching
                           – Errors, web servers that lie
                           – Parsing lots of formats, broken HTML
                        Open      source toolkit for web mining apps



             Does the world really need yet another web crawler?
             No, but it does need a web mining toolkit


             Two companies agreed to sponsor work on Bixo as an open source project.


             Polite yet efficient - tension between those two goals that’s hard to resolve.


             If you do a crawl of any reasonable size, you’ll run into lots of errors.


             Even if a web server says “I swear to you, I’m sending you a 20K HTML file
             in English”
             It’s a 50K text file in Russian using the Cyrillic character set.


             And because it’s open source, you get the benefit of a community of users.
             They contribute re-usable toolkit components.




                                                                                                       15
Elastic Web Mining                                                        01 November 2009




                     SEO Keyword Data Mining

                       Example   of typical web mining task
                       Find   common keywords (1,2,3 word terms)
                        – Do domain-centric web crawl
                        – Parse pages to extract title, meta, h1, links
                        – Output keywords sorted by frequency
                       Compare    to competitor site(s)




                                                                                       16
Elastic Web Mining                                                                          01 November 2009




                     Workflow




             Whenever I show a workflow diagram like this, I make a joke about it being
             intuitively obvious.


             Which, obviously, it’s not.


             And in fact the full workflow is a bit bigger, as I left out the second stage that
             describes more of the keyword analysis.


             But the key point is that the blue color items are provided by Cascading.
             And the green color items are provided by Bixo.
             So what’s left are two yellow items, which represent the two points of
             customization.




                                                                                                         17
Elastic Web Mining                                                                      01 November 2009




                     Custom Code for Example

                       Filtering     URLs inside domain
                          – Non-English content
                          – User-generated content (forums, etc)
                       Generating       keywords from text
                          – Special tokenization
                          – One, two, three word phrases

                       But    95% of code was generic



             There were two main pieces of custom code that needed to be written.


             One was some URL filtering to focus on the right content inside the web sites.
             Avoiding non-English pages by specific URL patterns.
             Same kind of thing for forums and such, since these pages weren’t part of what
             could easily be optimized.


             And if enough people need this type of support, since Bixo is open source it
             will likely become part of the toolkit




                                                                                                     18
Elastic Web Mining                                                                      01 November 2009




                     End Result in Data Mining Tool




             Finally we can actually use a traditional data mining tool to help make sense of
             the digested data.


             Many things we could do in addition
             Clustering of results, to improve keyword analysis
             Larger sites have “areas of interest”


             Identifying broken links, typos
             Identifying personal data - email addresses, phone numbers




                                                                                                     19
Elastic Web Mining                                                                         01 November 2009




                     What Next?

                        Another       example - mining mailing lists
                        Go     straight to Summary/Q&A
                        Talk     about Public Terabyte Dataset
                        Write      tweets, posts & emails
                        Find     people to meet in the lobby




             I try to limit presentations to 20 slides - so I’ve hit that limit


             In the spirit of the unconference - let me know what you’d like to do next.




                                                                                                        20
Elastic Web Mining                                                                    01 November 2009




                     Another Example - HUGMEE

                       Hadoop
                       Users who
                       Generate the
                       Most
                       Effective
                       Emails




             Let’s use a real example now of using Bixo to do web mining.

             Imagine that the Apache Foundation decided to honor people who make
             significant contributions to the Hadoop community.


             In a typical company, determining the winner would depend on political
             maneuvering, bribes,and sucking up.


             But the Apache Foundation could decides to go for a quantitative approach for
             the HUGMEE award.




                                                                                                   21
Elastic Web Mining                                                                      01 November 2009




                     Helpful Hadoopers

                       Use    mailing list archives for data (collect)
                       Parse    mbox files and emails (parse)
                       Score    based on key phrases (analyze)
                       End    result is score/name pair (produce)




             How do you figure out the most helpful Hadoopers?
             As we discussed previously, it’s a classic web mining problem


             Luckily the Hadoop mailing lists are all nicely archived as monthly mbox
             files.


             How do we score based on key phrases (next slide)?




                                                                                                     22
Elastic Web Mining                                                  01 November 2009




                     Scoring Algorithm

                       Very   sophisticated point system
                       “thanks”   == 5
                       “owe   you a beer” == 50
                       “worship   the ground you walk on” == 100




                                                                                 23
Elastic Web Mining                                                                  01 November 2009




                     High Level Steps

                        Collect    emails
                          –   Fetch mod_mbox generated page
                          –   Parse it to extract links to mbox files
                          –   Fetch mbox files
                          –   Split into separate emails

                        Parse    emails
                          – Extract key headers (messageId, email, etc)
                          – Parse body to identify quoted text



             Parsing the mod_mbox page is simple with Tika’s HtmlParser


             Cheated a bit when parsing emails - some users like Owen have many aliases
             So hand-generated alias resolution table.




                                                                                                 24
Elastic Web Mining                                                                    01 November 2009




                     High Level Steps

                       Analyze      emails
                          –   Find key phrases in replies (ignore signoff)
                          –   Score emails by phrases
                          –   Group & sum by message ID
                          –   Group & sum by email address

                       Produce      ranked list
                          – Toss email addresses with no love
                          – Sort by summed score



             Need to ignore “thanks” in “thanks in advance for doing my job for me”
             signoff.


             Generate two tuples for each email:
             -one with messageId/name/address
             -One with reply-to messageId/score


             Group/sum aspect is classic reduce operation.




                                                                                                   25
Elastic Web Mining                                                                        01 November 2009




                     Workflow




             I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom
             Cascading operations, 6 MR jobs.


             OK, actually not so clear, but…
             Key point is that only purple is stuff that I had to actually create
             Some lines are purple as well, since that workflow (DAG) is also something I
             defined - see next page.
             But only two custom operations actually needed - parsing mbox_page and
             calculating score


             Running took about 30 minutes - mostly politely waiting until it was Ok to
             politely do another fetch.
             Downloaded 150MB of mbox files
             409 unique email addresses with at least one positive reply.




                                                                                                       26
Elastic Web Mining                                                                        01 November 2009




                     Building the Flow




             Most of the code needed to create the workflow for this data mining app.


             Lots of oatmeal code - which is good. Don’t want to be writing tricky code
             here.


             Could optimize, but that would be a mistake…most web mining is
             programmer-constrained.
             So just use more servers in EC2 - cheaper & faster.




                                                                                                       27
Elastic Web Mining                                                              01 November 2009




                     mod_mbox Page




             Example of the top-level pages that were fetched in first phase.


             Then needed to be parsed to extract links to mbox files.




                                                                                             28
Elastic Web Mining                                    01 November 2009




                     Custom Operation




             Example of one of two custom operation
             Parsing mod_mbox page
             Uses Tika to extract Ids
             Emits tuple with URL for each mbox ID




                                                                   29
Elastic Web Mining                                                          01 November 2009




                     Validate




             Curve looks right - exponential decay.
             409 unique email addresses that got some love from somebody.




                                                                                         30
Elastic Web Mining                                                 01 November 2009




                     This Hug’s for Ted!




             And the winner is…Ted Dunning


             I know - I should have colored the elephant yellow.




                                                                                31
Elastic Web Mining                                                                    01 November 2009




                     Produce




             A list of the usual suspects


             Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.




                                                                                                   32
Elastic Web Mining                                                     01 November 2009




                     Public Terabyte Dataset

                       Sponsored    by Concurrent/Bixolabs
                       High   quality crawl of top domains
                        – HECB Stack using Elastic Map Reduce

                       Hosted   by Amazon in S3, free to EC2 users
                       Crawl   & processing code available
                       Questions,   input? http://bixolabs.com/PTD/

                       Back




                                                                                    33
Elastic Web Mining                                                    01 November 2009




                     Summary

                       HECB     stack works well for web mining
                        – Cheaper than typical colo option
                        – Scales to hundreds of millions of pages
                        – Reliable and efficient workflow
                       Web    mining has high & increasing value
                        –   Search engine optimization, advertising
                        –   Social networks, reputation
                        –   Competitive pricing
                        –   Etc, etc, etc.




                                                                                   34
Elastic Web Mining                                                       01 November 2009




                     Any Questions?

                         My email:

                          ken@bixolabs.com
                         Bixo mailing list:
                          http://tech.groups.yahoo.com/group/bixo-dev/




                                                                                      35
Elastic Web Mining   01 November 2009




                                  36

Weitere ähnliche Inhalte

Andere mochten auch

Interactive Whiteboards Presentation
Interactive Whiteboards PresentationInteractive Whiteboards Presentation
Interactive Whiteboards Presentationbec28
 
Governance: The melt down | Biocity Studio
Governance: The melt down | Biocity StudioGovernance: The melt down | Biocity Studio
Governance: The melt down | Biocity StudioBiocity Studio
 
Ariix奖励计划
Ariix奖励计划Ariix奖励计划
Ariix奖励计划waytorich
 
Version 6 Intro, Value & Methodology & Conclusion
Version 6    Intro,  Value &  Methodology &  ConclusionVersion 6    Intro,  Value &  Methodology &  Conclusion
Version 6 Intro, Value & Methodology & ConclusionEDP125
 
Open Days Ingria "Коротко о коммуникациях"
Open Days Ingria "Коротко о коммуникациях"Open Days Ingria "Коротко о коммуникациях"
Open Days Ingria "Коротко о коммуникациях"Ingria. Technopark St. Petersburg
 
сучасний редактор в інтернеті
сучасний редактор в інтернетісучасний редактор в інтернеті
сучасний редактор в інтернетіAnna Demyanova
 
Sydney's Biodiversity Solutions | Biocity Studio
Sydney's Biodiversity Solutions | Biocity StudioSydney's Biodiversity Solutions | Biocity Studio
Sydney's Biodiversity Solutions | Biocity StudioBiocity Studio
 
Developing Financial Capability
Developing Financial CapabilityDeveloping Financial Capability
Developing Financial CapabilityPaul Carpenter
 
Kartagen4new
Kartagen4newKartagen4new
Kartagen4newOpanki Gm
 
Никита Цуканов рассказал, как нужно учиться на чужих ошибках
Никита Цуканов рассказал, как нужно учиться на чужих ошибкахНикита Цуканов рассказал, как нужно учиться на чужих ошибках
Никита Цуканов рассказал, как нужно учиться на чужих ошибкахIngria. Technopark St. Petersburg
 
Artikel I Maintain Juni 2011
Artikel I Maintain Juni 2011Artikel I Maintain Juni 2011
Artikel I Maintain Juni 2011AlexSport
 
Serap mutlu akbulut konseri 28 mayıs 2013
Serap mutlu akbulut konseri 28 mayıs 2013Serap mutlu akbulut konseri 28 mayıs 2013
Serap mutlu akbulut konseri 28 mayıs 2013aokutur
 
Ethics Project
Ethics ProjectEthics Project
Ethics Projectnicoleex3
 

Andere mochten auch (20)

Smartboards
SmartboardsSmartboards
Smartboards
 
Interactive Whiteboards Presentation
Interactive Whiteboards PresentationInteractive Whiteboards Presentation
Interactive Whiteboards Presentation
 
Governance: The melt down | Biocity Studio
Governance: The melt down | Biocity StudioGovernance: The melt down | Biocity Studio
Governance: The melt down | Biocity Studio
 
Styleguide English
Styleguide EnglishStyleguide English
Styleguide English
 
Ariix奖励计划
Ariix奖励计划Ariix奖励计划
Ariix奖励计划
 
Ingria games Joybits
Ingria games JoybitsIngria games Joybits
Ingria games Joybits
 
Version 6 Intro, Value & Methodology & Conclusion
Version 6    Intro,  Value &  Methodology &  ConclusionVersion 6    Intro,  Value &  Methodology &  Conclusion
Version 6 Intro, Value & Methodology & Conclusion
 
Open Days Ingria "Коротко о коммуникациях"
Open Days Ingria "Коротко о коммуникациях"Open Days Ingria "Коротко о коммуникациях"
Open Days Ingria "Коротко о коммуникациях"
 
Advaced Content Strategies 5: Promoting
Advaced Content Strategies 5: PromotingAdvaced Content Strategies 5: Promoting
Advaced Content Strategies 5: Promoting
 
Aztecs3
Aztecs3Aztecs3
Aztecs3
 
сучасний редактор в інтернеті
сучасний редактор в інтернетісучасний редактор в інтернеті
сучасний редактор в інтернеті
 
Sydney's Biodiversity Solutions | Biocity Studio
Sydney's Biodiversity Solutions | Biocity StudioSydney's Biodiversity Solutions | Biocity Studio
Sydney's Biodiversity Solutions | Biocity Studio
 
Vestnik april 2014
Vestnik april 2014Vestnik april 2014
Vestnik april 2014
 
Developing Financial Capability
Developing Financial CapabilityDeveloping Financial Capability
Developing Financial Capability
 
Kartagen4new
Kartagen4newKartagen4new
Kartagen4new
 
Никита Цуканов рассказал, как нужно учиться на чужих ошибках
Никита Цуканов рассказал, как нужно учиться на чужих ошибкахНикита Цуканов рассказал, как нужно учиться на чужих ошибках
Никита Цуканов рассказал, как нужно учиться на чужих ошибках
 
Artikel I Maintain Juni 2011
Artikel I Maintain Juni 2011Artikel I Maintain Juni 2011
Artikel I Maintain Juni 2011
 
Serap mutlu akbulut konseri 28 mayıs 2013
Serap mutlu akbulut konseri 28 mayıs 2013Serap mutlu akbulut konseri 28 mayıs 2013
Serap mutlu akbulut konseri 28 mayıs 2013
 
倒影
倒影倒影
倒影
 
Ethics Project
Ethics ProjectEthics Project
Ethics Project
 

Ähnlich wie Elastic Web Mining

The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitTom Croucher
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0animove
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009LeeFeigenbaum
 
Semantic Web Mining
Semantic Web MiningSemantic Web Mining
Semantic Web MiningAnil Mishra
 
Morning with MongoDB Paris 2012 - Making Big Data Small
Morning with MongoDB Paris 2012 - Making Big Data SmallMorning with MongoDB Paris 2012 - Making Big Data Small
Morning with MongoDB Paris 2012 - Making Big Data SmallMongoDB
 
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...CA API Management
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowRichard Wallis
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Science and Web2.0
Science and Web2.0Science and Web2.0
Science and Web2.0Ian Mulvany
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Top 5 Web Trends Of 2009 Structured Data
Top 5 Web Trends Of 2009  Structured DataTop 5 Web Trends Of 2009  Structured Data
Top 5 Web Trends Of 2009 Structured Datachmingl
 
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...Dr. Haxel Consult
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28korusamol
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 

Ähnlich wie Elastic Web Mining (20)

The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Semantic Web For Dummies
Semantic Web For DummiesSemantic Web For Dummies
Semantic Web For Dummies
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009
 
Search V Next Final
Search V Next FinalSearch V Next Final
Search V Next Final
 
Semantic Web Mining
Semantic Web MiningSemantic Web Mining
Semantic Web Mining
 
Morning with MongoDB Paris 2012 - Making Big Data Small
Morning with MongoDB Paris 2012 - Making Big Data SmallMorning with MongoDB Paris 2012 - Making Big Data Small
Morning with MongoDB Paris 2012 - Making Big Data Small
 
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...
Scaling the API Economy - with Scale-Free Networks API Days Keynote from Laye...
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Science and Web2.0
Science and Web2.0Science and Web2.0
Science and Web2.0
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Top 5 Web Trends Of 2009 Structured Data
Top 5 Web Trends Of 2009  Structured DataTop 5 Web Trends Of 2009  Structured Data
Top 5 Web Trends Of 2009 Structured Data
 
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud based Web Intelligence
Cloud based Web IntelligenceCloud based Web Intelligence
Cloud based Web Intelligence
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 

Mehr von Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 

Mehr von Ken Krugler (8)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 

Kürzlich hochgeladen

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Kürzlich hochgeladen (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Elastic Web Mining

  • 1. Elastic Web Mining 01 November 2009 1
  • 2. Elastic Web Mining 01 November 2009 Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Silicon Valley Data Mining Camp 01 November 2009 2
  • 3. Elastic Web Mining 01 November 2009 About me  Background in vertical web crawl – Krugle search engine for open source code – Bixo open source web mining toolkit  Consultant for companies using EC2 – Web mining – Data processing  Founder of Bixo Labs – Elastic web mining platform – http://bixolabs.com Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. Several companies paid me to integrate Bixo into an existing data processing environment. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task. 3
  • 4. Elastic Web Mining 01 November 2009 Typical Data Mining This is the world that many of you live in. Analyzing data to find important patterns. Here’s an example of output from the QlikView business intelligence tool It was used to help analyze the relative prevalence of keywords in two competing web sites. Here you see two word terms that often occur on McAfee’s site, but not on Symantec’s Which is very useful data for anybody who worries about search engine optimization. 4
  • 5. Elastic Web Mining 01 November 2009 Data Mining Victory! You all know about analyzing data to find important patterns that get managers all worked up… 5
  • 6. Elastic Web Mining 01 November 2009 Meanwhile, Over at McAfee… But how do you get to this point? How do you use the web as the source for data that you’re analyzing That’s what I’m going to be talking about here. 6
  • 7. Elastic Web Mining 01 November 2009 Web Mining 101  Extracting & Analyzing Web Data  More Than Just Search  Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts… Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing. 7
  • 8. Elastic Web Mining 01 November 2009 4 Steps in Web Mining  Collect - fetch content from web  Parse - extract data from formats  Analyze - tokenize, rate, classify, cluster  Produce - “useful data” It’s common to confuse web crawling with fetching. Crawling is the process of automatically finding new pages by extracting links from fetched pages. But for many web mining applications, you have a “white list” of pre-defined URLs. In either case, though, you need to reliably, efficiently and politely fetch pages. Content comes in a variety of formats - typically HTML, but also PDF, word, zip archives, etc. Need to parse these formats to extract key data - typically text, but could be image data. Often the analyze step will include aspects of machine learning - classification, clustering. “useful data” covers a lot of ground, because there are a lot of ways to use the output of web mining. Generating an index is one of the most common, because people think about search as the goal. But for data mining, the end result at this point is often highly reduced data that is input to traditional data mining tools. 8
  • 9. Elastic Web Mining 01 November 2009 Web Mining versus Data Mining  Scale - 10 million isn’t a big number  Access - public but restricted – Special implicit rules apply  Structure - not much What are the key differences between web mining and traditional data mining I’m saying “traditional” because the face of data mining is clearly changing. But if you look at most vendor tools, the focus is on what I’d call “traditional data mining” Scale - 10M is big for data mining, but not for web mining Access - with DM, once you defeated Mongor, keeper of data base access keys, you were golden Web pages are typically public, but it’s a shared resource so implicit rules apply. Like “don’t bring my web site to its knees”. Data mining breaks traditional implicit contract, so extra cautions apply. Implicit contract is that I let you crawl me, and you drive traffic to me when your search index goes live. But with DM, there often isn’t an index as the end result. With mining DBs, there’s explicit structure, which is mostly lacking from web pages. 9
  • 10. Elastic Web Mining 01 November 2009 How to Mine Large Scale Web Data?  Start with scalable map-reduce platform  Add a workflow API layer  Mix in a web crawling toolkit  Write your custom data processing code  Run in an elastic cloud environment If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately want to process from the web If you can’t create real workflows, it will never be reliable or efficient. If you don’t use specialized web crawling code, you’ll get blacklisted Because you’re trying to distill down large data, there’s often some custom processing. If you don’t run it a cloud environment, you’ll be wasting money - and I’ll explain why in a few slides. 10
  • 11. Elastic Web Mining 01 November 2009 One Solution - the HECB Stack  Bixo  Cascading  Hadoop  EC2 I’m focusing on one particular solution to the challenges of web mining that I just described. It’s the “HECB” stack. I’m going to talk about these from the bottom up, which is EC2 first, then Hadoop…but the acronym didn’t work as well. 11
  • 12. Elastic Web Mining 01 November 2009 EC2 - Amazon Elastic Compute Cloud  True cost of non-cloud environment – Cost of servers & networking (2 year life) – Cost of colo (6 servers/rack) – Cost of OPS salary (15% of FTE/cluster) – Managing servers is no fun  Web mining is perfect for the cloud – “bursty” => savings are even greater – Data is distilled, so no transfer $$$ pain At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster In the end, our actual utilization ratio was probably < 20% Even with close to 100% utilization, the break-even point for EC2 vs. colo is somewhere between 50 and 200 servers, depending on who you talk to. If utilization was 20%, then break even would be 250 to 1000 servers. Mining for search doesn’t work so well in this model - cluster should be always crawling (ABC) so not as bursty And transferring raw content, parse, and index will generate lots of transfer charges. But for web mining that’s focused on data mining, data is distilled so this isn’t an issue. 12
  • 13. Elastic Web Mining 01 November 2009 Why Hadoop?  Perfect for processing lots of data – Map-reduce – Distributed file system  Open source, large community, etc.  Runs well in EC2 clusters  Elastic Map Reduce as option Map-reduce - how do you parallelize the processing of lots of data so that you can Do the work on many servers? The answer is Map-reduce. HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner. How do you make sure the data (the big stuff) moves as little as possible during processing. The answer is the Hadoop distributed file system. It’s open source, so lots of support, consultants, rapid bug fixes, etc. Large companies are using it, especially Yahoo Elastic map reduce is a special service built on top of EC2, where it’s easier to run Hadoop jobs Because you have access to pre-configured Hadoop clusters, special tools, etc. 13
  • 14. Elastic Web Mining 01 November 2009 Why Cascading?  API on top of Hadoop  Supports efficient, reliable workflows  Reduces painful low-level MR details  Build workflow using “pipe” model If you ever had to write a complex workflow using Hadoop, you know the answer. It frees you from the lower-level details of thinking in map-reduce. You can think about the workflow as operations on records with fields. And in data mining, the workflow often winds up being very complex. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle 14
  • 15. Elastic Web Mining 01 November 2009 Why Bixo?  Plugs into Cascading-based workflow – Scales with Hadoop cluster – Rules well in EC2  Handles grungy web crawling details – Polite yet efficient fetching – Errors, web servers that lie – Parsing lots of formats, broken HTML  Open source toolkit for web mining apps Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. Polite yet efficient - tension between those two goals that’s hard to resolve. If you do a crawl of any reasonable size, you’ll run into lots of errors. Even if a web server says “I swear to you, I’m sending you a 20K HTML file in English” It’s a 50K text file in Russian using the Cyrillic character set. And because it’s open source, you get the benefit of a community of users. They contribute re-usable toolkit components. 15
  • 16. Elastic Web Mining 01 November 2009 SEO Keyword Data Mining  Example of typical web mining task  Find common keywords (1,2,3 word terms) – Do domain-centric web crawl – Parse pages to extract title, meta, h1, links – Output keywords sorted by frequency  Compare to competitor site(s) 16
  • 17. Elastic Web Mining 01 November 2009 Workflow Whenever I show a workflow diagram like this, I make a joke about it being intuitively obvious. Which, obviously, it’s not. And in fact the full workflow is a bit bigger, as I left out the second stage that describes more of the keyword analysis. But the key point is that the blue color items are provided by Cascading. And the green color items are provided by Bixo. So what’s left are two yellow items, which represent the two points of customization. 17
  • 18. Elastic Web Mining 01 November 2009 Custom Code for Example  Filtering URLs inside domain – Non-English content – User-generated content (forums, etc)  Generating keywords from text – Special tokenization – One, two, three word phrases  But 95% of code was generic There were two main pieces of custom code that needed to be written. One was some URL filtering to focus on the right content inside the web sites. Avoiding non-English pages by specific URL patterns. Same kind of thing for forums and such, since these pages weren’t part of what could easily be optimized. And if enough people need this type of support, since Bixo is open source it will likely become part of the toolkit 18
  • 19. Elastic Web Mining 01 November 2009 End Result in Data Mining Tool Finally we can actually use a traditional data mining tool to help make sense of the digested data. Many things we could do in addition Clustering of results, to improve keyword analysis Larger sites have “areas of interest” Identifying broken links, typos Identifying personal data - email addresses, phone numbers 19
  • 20. Elastic Web Mining 01 November 2009 What Next?  Another example - mining mailing lists  Go straight to Summary/Q&A  Talk about Public Terabyte Dataset  Write tweets, posts & emails  Find people to meet in the lobby I try to limit presentations to 20 slides - so I’ve hit that limit In the spirit of the unconference - let me know what you’d like to do next. 20
  • 21. Elastic Web Mining 01 November 2009 Another Example - HUGMEE  Hadoop  Users who  Generate the  Most  Effective  Emails Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award. 21
  • 22. Elastic Web Mining 01 November 2009 Helpful Hadoopers  Use mailing list archives for data (collect)  Parse mbox files and emails (parse)  Score based on key phrases (analyze)  End result is score/name pair (produce) How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)? 22
  • 23. Elastic Web Mining 01 November 2009 Scoring Algorithm  Very sophisticated point system  “thanks” == 5  “owe you a beer” == 50  “worship the ground you walk on” == 100 23
  • 24. Elastic Web Mining 01 November 2009 High Level Steps  Collect emails – Fetch mod_mbox generated page – Parse it to extract links to mbox files – Fetch mbox files – Split into separate emails  Parse emails – Extract key headers (messageId, email, etc) – Parse body to identify quoted text Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table. 24
  • 25. Elastic Web Mining 01 November 2009 High Level Steps  Analyze emails – Find key phrases in replies (ignore signoff) – Score emails by phrases – Group & sum by message ID – Group & sum by email address  Produce ranked list – Toss email addresses with no love – Sort by summed score Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: -one with messageId/name/address -One with reply-to messageId/score Group/sum aspect is classic reduce operation. 25
  • 26. Elastic Web Mining 01 November 2009 Workflow I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply. 26
  • 27. Elastic Web Mining 01 November 2009 Building the Flow Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster. 27
  • 28. Elastic Web Mining 01 November 2009 mod_mbox Page Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files. 28
  • 29. Elastic Web Mining 01 November 2009 Custom Operation Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID 29
  • 30. Elastic Web Mining 01 November 2009 Validate Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody. 30
  • 31. Elastic Web Mining 01 November 2009 This Hug’s for Ted! And the winner is…Ted Dunning I know - I should have colored the elephant yellow. 31
  • 32. Elastic Web Mining 01 November 2009 Produce A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm. 32
  • 33. Elastic Web Mining 01 November 2009 Public Terabyte Dataset  Sponsored by Concurrent/Bixolabs  High quality crawl of top domains – HECB Stack using Elastic Map Reduce  Hosted by Amazon in S3, free to EC2 users  Crawl & processing code available  Questions, input? http://bixolabs.com/PTD/ Back 33
  • 34. Elastic Web Mining 01 November 2009 Summary  HECB stack works well for web mining – Cheaper than typical colo option – Scales to hundreds of millions of pages – Reliable and efficient workflow  Web mining has high & increasing value – Search engine optimization, advertising – Social networks, reputation – Competitive pricing – Etc, etc, etc. 34
  • 35. Elastic Web Mining 01 November 2009 Any Questions?  My email: ken@bixolabs.com  Bixo mailing list: http://tech.groups.yahoo.com/group/bixo-dev/ 35
  • 36. Elastic Web Mining 01 November 2009 36