SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Hadoop/Mahout/HBase



                     2011/04/10
                   #TokyoWebmining10-2

                         yanaoki



2011   4   18
•
                • HBase
                • Mahout
                • Naive Bayes
                •
                • Web

2011   4   18
•
                    •   naoki yanai
                •
                    •
                    •                 …

                •
                    •
                    •       Hadoop

                •
                    •

2011   4   18
HBase
                •   KeyValue

                    •                                                         read/write

                        •   goal is the hosting of very large tables -- billions of rows ,
                            millions of columns ...


                    •   Hadoop

                •   CAP                   C,P

                    •   C:            ,A:             ,P:

                •            Sharding

                •   Hadoop/MapReduce
2011   4   18
HBase
                •
                    •   ―

                    •   ―

                    •



                            qualifier

2011   4   18
Mahout
           •
           •    Hadoop

                •
                •                          HBase

                •
           •
                •   Classifier / Clustering / Pattern Mining

                •   Recommenders / Collaborative Filtering

                •   Evolutionary Algorithms ...
2011   4   18
Mahout

           •
           •
                •
                •   Mahout

                •   Mahout in Action PDF

                •   hamadakoichi

                •   TokyoWebmining

2011   4   18
Naive Bayes
           •        F1,...,Fn           C




           •    C




           •


2011   4   18
Naive Bayes
                •
                    •
                        •
                    •
                        •
                    •
                        •
2011   4   18
Naive Bayes
                •
                    •
                •
                    •
                •
                    •
                •
                    •
2011   4   18
•       Web

                    •
                    •
                    •
                •
                    •

2011   4   18
2011   4   18
•    Ruby

                •   ExtractContent

           require "open-uri"
           require "extractcontent"

           html = open("http://
           news.nifty.com/....htm").read
           body, title = ExtractContent::analyse(html)

           puts body.toutf8 #=>        HTML


2011   4   18
•    Ruby

                •   scrAPI


       require 'scrapi'
       require 'open-uri'

       scr = Scraper.define do
        process "div.tweet", "tweets[]"=> :text
        result :tweets
       end

       tweets = scr.scrape(URI.parse("http://togetter.com/li/
       121476"), :parser_options => {:char_encoding => 'utf8'})

       tweets.each{ |tw| puts tw } #=>


2011   4   18
•                                             RSS                      HBase


           •
                      (URL)
                                         content                         categories

       http://togetter/1.html                                  category:src=”togetter”
                                                   ...
                                                               category:cat=”social”

       http://                                                 category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                               category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/
                                                  …

       http://ameblo.jp/....html
                                   KARA …

2011   4   18
•    HBase

                    category_id <TAB>

           •    HBase           MaprReduce   HDFS

                •
                    •
                    •
                        •   Wikipedia

                    •
2011   4   18
•    mahout

                    $ mahout trainclassifier       ...

                    $ mahout testclassifier        …

           •    mahout

                •    --input/--output         /

                •    --dataSource                   HDFS   HBase

                •    --gramSize     N-gram

                •    --classifierType

                •    --alpha

                •    --minDF/--minSupport                  /

2011   4   18
•                            HBase


           •
           =======================================================
           Summary
           -------------------------------------------------------
           Correctly Classified Instances          :       1884       82.2348%
           Incorrectly Classified Instances        :        407       17.7652%
           Total Classified Instances              :       2291
           =======================================================
           Confusion Matrix
           -------------------------------------------------------
           a       b       c       d       e       <--Classified as
           216     32      22      155     0        |  425         a     = t
           0       514     13      70      0        |  597         b     = s
           0       2       514     9       0        |  525         c     = e
           1       8       13      638     0        |  660         d     = b
           0       0       67      15      2        |  84          e     = a
           Default Category: unknown: 5


2011   4   18
•
           •                                      reducer                      HBase


            //
            BayesParameters params = new BayesParameters();
            params.set("alpha_i", "1");
            algorithm = new CBayesAlgorithm();
            datastore = new HBaseBayesDatastore("model_table_name", params);
            classifier = new ClassifierContext(algorithm, datastore);

            //
            ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
            [doc.size()]), "default");

            String label = category.getLabel();


2011   4   18
•

                      (URL)
                                         content                        categories

       http://togetter/1.html                                 category:src=”togetter”
                                                   ...
                                                              category:cat=”social”

       http://                                                category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                              category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/                                 category:cat=”technology”
                                                  …

       http://ameblo.jp/....html
                                   KARA …                     category:cat=”entertainment”

2011   4   18
Web




2011   4   18
Web
                •   Google News Togetter
                                   RSS

                •
                    •                              …

                    •                                         …
                •
                        a                   935        5.2M
                        b                  5,112       7.2M
                        e                  3,746       8.1M
                        s                  4,737       12M
                        t                  3,969       9.2M
2011   4   18
4/18

                                 Web
                •
                      •
                =======================================================
                Summary
                -------------------------------------------------------
                Correctly Classified Instances          :      13388        91.6798%
                Incorrectly Classified Instances        :       1215         8.3202%
                Total Classified Instances              :      14603

                =======================================================
                Confusion Matrix
                -------------------------------------------------------
                a         b         c         d         e         <--Classified as
                2328      19        515       250       0          |  3112       a       =   t
                3         2939      54        20        0          |  3016       b       =   e
                32        3         3542      109       0          |  3686       c       =   s
                33        16        128       3877      0          |  4054       d       =   b
                1         27        2         3         702        |  735        e       =   a
                Default Category: unknown: 5


2011   4   18
Web


                •
                    •
                        •                              alpha


                              1         0.5     0.1        0.01    0.001




                            65.38%   65.83%   66.73%     66.82%   67.02%


2011   4   18
4/18

                               Web


                •
                    •
                        •   N-Gram


                                     unigram   bigram


                                     63.57%    66.09%


2011   4   18
Web


                •
                    •
                        •

                                           +




                                  56.8%   65.38%


2011   4   18
4/18

                            Web


                •
                    •
                        •



                                  67.02%   67.88%


2011   4   18
•
                    •
                •               HBase/Mahout

                    •
                    •   HBase



2011   4   18
2011   4   18

Weitere ähnliche Inhalte

Andere mochten auch

ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
 

Andere mochten auch (15)

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
 
Frequency Pattern Mining
Frequency Pattern MiningFrequency Pattern Mining
Frequency Pattern Mining
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
 

Ähnlich wie Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
dzhou
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
Korea Sdec
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
Pai-Cheng Tao
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
pharkmillups
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
moai kids
 

Ähnlich wie Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase, no trouble
HBase, no troubleHBase, no trouble
HBase, no trouble
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebook
 
HBase app HUG talk
HBase app HUG talkHBase app HUG talk
HBase app HUG talk
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  • 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  • 2. • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  • 3. • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  • 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  • 5. HBase • • ― • ― • qualifier 2011 4 18
  • 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  • 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  • 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  • 9. Naive Bayes • • • • • • • 2011 4 18
  • 10. Naive Bayes • • • • • • • • 2011 4 18
  • 11. Web • • • • • 2011 4 18
  • 12. 2011 4 18
  • 13. Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  • 14. Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  • 15. RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  • 16. HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  • 17. mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  • 18. HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  • 19. • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  • 20. (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  • 21. Web 2011 4 18
  • 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  • 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  • 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  • 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  • 26. Web • • • + 56.8% 65.38% 2011 4 18
  • 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  • 28. • • HBase/Mahout • • HBase 2011 4 18
  • 29. 2011 4 18