SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Incorpora(ng	
  Site-­‐Level	
  Knowledge	
  to	
  
Extract	
  Structured	
  Data	
  from	
  Web	
  Forums

            Jiang-­‐Ming	
  Yang,	
  Rui	
  Cai,	
  Yida	
  Wang,	
  Jun	
  Zhu,	
  Lei	
  Zhang,	
  and	
  Wei-­‐Ying	
  Ma
                                             Web	
  Search	
  &	
  Mining	
  Group
                                                 Microso=	
  Research	
  Asia


                                                             2009-­‐04



Saturday, May 22, 2010
Web	
  Forum	
  Data
      • An	
  important	
  informa,on	
  resource	
  with	
  a	
  lot	
  of	
  human	
  
        knowledge.


      • These	
  informa,on	
  include	
  recrea,on,	
  sports,	
  games,	
  
        computers,	
  art,	
  society,	
  science,	
  home,	
  health;


      • 20%	
  pages	
  on	
  the	
  search	
  results	
  are	
  from	
  forums




Saturday, May 22, 2010
Understanding	
  Forum


                                                   Quality	
  
                                       Data	
  
                         Crawling                 Assessmen
                                    ExtracIon
                                                       t




Saturday, May 22, 2010
Understanding	
  Forum


                                                                                                Quality	
  
                                                                 Data	
  
                         Crawling                                                              Assessmen
                                                              ExtracIon
                                                                                                    t
     WWW’08                                             WWW’09,                          SIGIR’09
     iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
   AutomaIon	
  Data	
  ExtracIon   Quality	
  Assessment
     Web	
  Forums

     SIGIR’08
     Exploring	
  Traversal	
  Strategy

     KDD’09
     Incremental	
  Crawling



Saturday, May 22, 2010
Understanding	
  Forum


                                                                                                Quality	
  
                                                                 Data	
  
                         Crawling                                                              Assessmen
                                                              ExtracIon
                                                                                                    t
     WWW’08                                             WWW’09,                          SIGIR’09
     iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
   AutomaIon	
  Data	
  ExtracIon   Quality	
  Assessment
     Web	
  Forums

     SIGIR’08
     Exploring	
  Traversal	
  Strategy

     KDD’09
     Incremental	
  Crawling



Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




     •    Leverage	
  more	
  site-­‐level	
  knowledge




Saturday, May 22, 2010
Saturday, May 22, 2010
Saturday, May 22, 2010
Forum	
  Sitemap
      • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
        consis,ng	
  of	
  a	
  set	
  of	
  ver$ces	
  and	
  the	
  links




Saturday, May 22, 2010
Forum	
  Sitemap
        • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
          consis,ng	
  of	
  a	
  set	
  of	
  ver$ces	
  and	
  the	
  links




    •     Rui	
  Cai,	
  Jiangming	
  Yang,	
  Wei	
  Lai,	
  Yida	
  Wang	
  and	
  Lei	
  Zhang.	
  iRobot:	
  An	
  Intelligent	
  Crawler	
  for	
  Web	
  Forums.	
  In	
  Proceedings	
  of	
  WWW	
  2008	
  Conference



Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
  to	
  describe	
  template
          – Layout	
  can	
  be	
  characterized	
  by	
  the	
  HTML	
  elements	
  in	
  
            different	
  DOM	
  paths




Saturday, May 22, 2010
Page	
  Clustering




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery




                                  Clustering	
  by	
  
                                  Virtual	
  Tables




Saturday, May 22, 2010
Link	
  Analysis




                         A	
  Link	
  =	
  URL	
  Pa4ern	
  +	
  Loca9on



Saturday, May 22, 2010
Saturday, May 22, 2010
Inner-­‐Page	
  Features
                                          •   The	
  inclusion	
  rela9on.	
  Data	
  records	
  
                                              usually	
  have	
  inclusion	
  relaIons.

                                          •   The	
  alignment	
  rela9on.	
  Since	
  data	
  is	
  
                                              generated	
  from	
  database	
  and	
  
                                              represented	
  via	
  templates,	
  data	
  
                                              records	
  with	
  the	
  same	
  label	
  may	
  
                                              appear	
  repeatedly	
  in	
  a	
  page.

                                          •   Time	
  Order.	
  Since	
  post	
  records	
  are	
  
                                              generated	
  sequenIally	
  along	
  
                                              Imeline,	
  the	
  post	
  Ime	
  should	
  be	
  
                                              sorted	
  ascending	
  or	
  descending.




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Saturday, May 22, 2010
Problem	
  SeGng




Saturday, May 22, 2010
Problem	
  SeGng

                         Author




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title   Content




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




                               • Formulas	
  for	
  iden9fying	
  list	
  9tle




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




                            • Formulas	
  for	
  iden9fying	
  post	
  author




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




                            • Formulas	
  for	
  iden9fying	
  post	
  author




Saturday, May 22, 2010
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  9me




                            • Formulas	
  for	
  iden9fying	
  post	
  content




Saturday, May 22, 2010
Saturday, May 22, 2010
Markov	
  Logic	
  Networks
      • An	
  MLN	
  can	
  be	
  viewed	
  as	
  a	
  template	
  for	
  construc,ng	
  Markov	
  
        Random	
  Fields.	
  


      • With	
  a	
  set	
  of	
  formulas	
  and	
  constants,	
  MLNs	
  define	
  a	
  Markov	
  
        network	
  with	
  one	
  node	
  per	
  ground	
  atom	
  and	
  one	
  feature	
  per	
  
        ground	
  formula.	
  The	
  probability	
  of	
  a	
  state	
  x	
  in	
  such	
  a	
  network	
  
        is	
  given	
  by:




Saturday, May 22, 2010
Markov	
  Logic	
  Networks
      • Divide	
  DOM	
  tree	
  elements	
  into	
  three	
  categories	
  :

            – Text	
  element
            – Hyperlink	
  element
            – Inner	
  element

      • Benefit

            – Reduce	
  the	
  number	
  of	
  possible	
  groundings	
  in	
  inference.	
  

            – Reduce	
  the	
  ambiguity	
  and	
  achieve	
  beRer	
  performance.


Saturday, May 22, 2010
Experiments




                         List	
  Pages    Post	
  Pages


Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Future	
  works




Saturday, May 22, 2010
Future	
  works




                                           hJp://discussions.apple.com/
Saturday, May 22, 2010
Conclusion
      • A	
  template-­‐independent	
  approach	
  to	
  extract	
  
        structured	
  data	
  from	
  web	
  forum	
  sites.

      • we	
  can	
  leverage	
  power	
  of	
  site-­‐level	
  informaIon,	
  
        such	
  as	
  the	
  mutual	
  informaIon	
  among	
  pages,	
  
        inner	
  or	
  inter	
  verIces	
  of	
  the	
  sitemap.

      • hZp://research.microso=.com/people/jmyang/


Saturday, May 22, 2010

Weitere ähnliche Inhalte

Ähnlich wie Incorporating site level knowledge to extract structured data from web forums - keynote

Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP GroupDeveloping Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Groupminddog
 
06 View Controllers
06 View Controllers06 View Controllers
06 View ControllersMahmoud
 
Sakai And The Academic Enterprise
Sakai And The Academic EnterpriseSakai And The Academic Enterprise
Sakai And The Academic EnterpriseMichael Feldstein
 
Web Typography with CSS3
Web Typography with CSS3Web Typography with CSS3
Web Typography with CSS3Matthew Smith
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
Websockets - OMG! Someone broke the internet!
Websockets - OMG! Someone broke the internet!Websockets - OMG! Someone broke the internet!
Websockets - OMG! Someone broke the internet!James Lewis
 
Jim Webber R E S Tful Services
Jim  Webber    R E S Tful  ServicesJim  Webber    R E S Tful  Services
Jim Webber R E S Tful ServicesSOA Symposium
 
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)John Adams
 
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
Smart Cities, Open Data and SMW - SMWCon Spring 2012 KeynoteSmart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
Smart Cities, Open Data and SMW - SMWCon Spring 2012 KeynoteJoel Natividad
 
HTML 5: The Future of the Web
HTML 5: The Future of the WebHTML 5: The Future of the Web
HTML 5: The Future of the WebTim Wright
 
Movable Type 5 : 成長するプラットフォーム
Movable Type 5 : 成長するプラットフォームMovable Type 5 : 成長するプラットフォーム
Movable Type 5 : 成長するプラットフォームSix Apart KK
 
Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
Web技術の現状と将来 (Open Source Conference 2011 Kyoto) Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
Web技術の現状と将来 (Open Source Conference 2011 Kyoto) Rikkyo University
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementRENDER project
 
Database Management for 
Real Estate Professionals
Database Management for 
Real Estate ProfessionalsDatabase Management for 
Real Estate Professionals
Database Management for 
Real Estate ProfessionalsDoug Devitre
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版Rikkyo University
 
First look at SharePoint 2013
First look at SharePoint 2013First look at SharePoint 2013
First look at SharePoint 2013Adis Jugo
 
Jquery Introduction
Jquery IntroductionJquery Introduction
Jquery Introductioncabbiepete
 

Ähnlich wie Incorporating site level knowledge to extract structured data from web forums - keynote (20)

Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP GroupDeveloping Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
Developing Plugins on OpenVBX at Greater San Francisco Bay Area LAMP Group
 
06 View Controllers
06 View Controllers06 View Controllers
06 View Controllers
 
Sakai And The Academic Enterprise
Sakai And The Academic EnterpriseSakai And The Academic Enterprise
Sakai And The Academic Enterprise
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Web Typography with CSS3
Web Typography with CSS3Web Typography with CSS3
Web Typography with CSS3
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
Websockets - OMG! Someone broke the internet!
Websockets - OMG! Someone broke the internet!Websockets - OMG! Someone broke the internet!
Websockets - OMG! Someone broke the internet!
 
Jim Webber R E S Tful Services
Jim  Webber    R E S Tful  ServicesJim  Webber    R E S Tful  Services
Jim Webber R E S Tful Services
 
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
 
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
Smart Cities, Open Data and SMW - SMWCon Spring 2012 KeynoteSmart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
 
HTML 5: The Future of the Web
HTML 5: The Future of the WebHTML 5: The Future of the Web
HTML 5: The Future of the Web
 
Movable Type 5 : 成長するプラットフォーム
Movable Type 5 : 成長するプラットフォームMovable Type 5 : 成長するプラットフォーム
Movable Type 5 : 成長するプラットフォーム
 
Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
Web技術の現状と将来 (Open Source Conference 2011 Kyoto) Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
Web技術の現状と将来 (Open Source Conference 2011 Kyoto)
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data Management
 
Database Management for 
Real Estate Professionals
Database Management for 
Real Estate ProfessionalsDatabase Management for 
Real Estate Professionals
Database Management for 
Real Estate Professionals
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
 
First look at SharePoint 2013
First look at SharePoint 2013First look at SharePoint 2013
First look at SharePoint 2013
 
Jquery Introduction
Jquery IntroductionJquery Introduction
Jquery Introduction
 

Mehr von George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

Mehr von George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Incorporating site level knowledge to extract structured data from web forums - keynote

  • 1. Incorpora(ng  Site-­‐Level  Knowledge  to   Extract  Structured  Data  from  Web  Forums Jiang-­‐Ming  Yang,  Rui  Cai,  Yida  Wang,  Jun  Zhu,  Lei  Zhang,  and  Wei-­‐Ying  Ma Web  Search  &  Mining  Group Microso=  Research  Asia 2009-­‐04 Saturday, May 22, 2010
  • 2. Web  Forum  Data • An  important  informa,on  resource  with  a  lot  of  human   knowledge. • These  informa,on  include  recrea,on,  sports,  games,   computers,  art,  society,  science,  home,  health; • 20%  pages  on  the  search  results  are  from  forums Saturday, May 22, 2010
  • 3. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t Saturday, May 22, 2010
  • 4. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 5. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 13. Challenge • Leverage  more  site-­‐level  knowledge Saturday, May 22, 2010
  • 16. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links Saturday, May 22, 2010
  • 17. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links • Rui  Cai,  Jiangming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.  iRobot:  An  Intelligent  Crawler  for  Web  Forums.  In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  • 18. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 19. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 20. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 21. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 23. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 24. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 25. Page  Clustering Dom  Path  Feature   Discovery Clustering  by   Virtual  Tables Saturday, May 22, 2010
  • 26. Link  Analysis A  Link  =  URL  Pa4ern  +  Loca9on Saturday, May 22, 2010
  • 28. Inner-­‐Page  Features • The  inclusion  rela9on.  Data  records   usually  have  inclusion  relaIons. • The  alignment  rela9on.  Since  data  is   generated  from  database  and   represented  via  templates,  data   records  with  the  same  label  may   appear  repeatedly  in  a  page. • Time  Order.  Since  post  records  are   generated  sequenIally  along   Imeline,  the  post  Ime  should  be   sorted  ascending  or  descending. Saturday, May 22, 2010
  • 37. Problem  SeGng Author Saturday, May 22, 2010
  • 38. Problem  SeGng Author Title Saturday, May 22, 2010
  • 39. Problem  SeGng Author Title Content Saturday, May 22, 2010
  • 40. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 41. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 42. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 43. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 44. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 45. Formulas  of  post  page • Formulas  for  iden9fying  post  9me • Formulas  for  iden9fying  post  content Saturday, May 22, 2010
  • 47. Markov  Logic  Networks • An  MLN  can  be  viewed  as  a  template  for  construc,ng  Markov   Random  Fields.   • With  a  set  of  formulas  and  constants,  MLNs  define  a  Markov   network  with  one  node  per  ground  atom  and  one  feature  per   ground  formula.  The  probability  of  a  state  x  in  such  a  network   is  given  by: Saturday, May 22, 2010
  • 48. Markov  Logic  Networks • Divide  DOM  tree  elements  into  three  categories  : – Text  element – Hyperlink  element – Inner  element • Benefit – Reduce  the  number  of  possible  groundings  in  inference.   – Reduce  the  ambiguity  and  achieve  beRer  performance. Saturday, May 22, 2010
  • 49. Experiments List  Pages Post  Pages Saturday, May 22, 2010
  • 57. Future  works hJp://discussions.apple.com/ Saturday, May 22, 2010
  • 58. Conclusion • A  template-­‐independent  approach  to  extract   structured  data  from  web  forum  sites. • we  can  leverage  power  of  site-­‐level  informaIon,   such  as  the  mutual  informaIon  among  pages,   inner  or  inter  verIces  of  the  sitemap. • hZp://research.microso=.com/people/jmyang/ Saturday, May 22, 2010