SlideShare ist ein Scribd-Unternehmen logo
1 von 119
Downloaden Sie, um offline zu lesen
Opportunities and Challenges of Web Search and Mining   Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
Outline   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
WSE = Google Globalization!
WSE = Google
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competitions
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competition   Impacts  . Web Computing  . Knowledge Windows  . New Paradigm of Civilization
I.  Some Must-Know   Statistics
Online Language Populations ,[object Object]
Top Ten Languages in the Web ,[object Object],More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
Web Content Source:  Network Wizards Jan 99 Internet Domain Survey More and more  non-English pages
Web Users and Pages  (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
Number of Web Pages   The world’s  largest search engine ? ,[object Object],[object Object],Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.  Source: Search Engine Watch
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
II.  Inside WSE
Components  ,[object Object],[object Object],[object Object],[object Object]
Architecture   SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable  Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
Spider ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Index Server   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
System Anatomy
Data Structure Lexicon:  fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted  by wordID. Inside barrel,  sorted by docID Inverted Index: some content as  the forward index, but sorted by wordID. doc list is sorted by docID
Query Server ,[object Object],[object Object],[object Object],[object Object],[object Object]
PageRank
PageRank (Cont.) ,[object Object],[object Object],[object Object]
Search Functions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Document Delivery   ,[object Object],[object Object],[object Object],[object Object],[object Object]
III.  Business
What is Google?   ,[object Object],[object Object],[object Object],[object Object]
Company Facts Employees:  1,300+ Languages spoken: 34 Worldwide Offices:  21 (Mostly in US & Europe) Annual Revenues: $900m
Google Revenue ,[object Object],[object Object],[object Object],Source:  Eric Schmidt Interview,  PCWorld.com (January 30, 2002)
Sources of Revenue   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Challenges (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Ebay and Amazon ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Microsoft and Yahoo ,[object Object],[object Object],[object Object],[object Object],[object Object]
IV.  Impacts
Impacts   ,[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Knowledge Windows   ,[object Object],[object Object],[object Object],[object Object]
New Web OS ,[object Object],[object Object],[object Object]
V.  New Gen. of WSE
Advanced Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
New Features in Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Other Search Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Clusty.com
Example on Vivisimo
Vivisimo  (cont.)
New Directions   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
VI.  Web Mining
Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
Improving Search via Mining Millions of Users Web texts, images, logs   … Search Engine Knowledge Discovery
Valuable Web Resources  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Hyper Links Anchor Texts  Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
Discovered Knowledge  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Users’ Preferences/Need:  Topic, Location,  Timing, … Authority/Popularity: Site, File, People,  Company, Product Clusters/Associations/ Relations:  Site, Page, People,  Company, Product,  Query
Web Mining for IR Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction   Text mining Filtering
[object Object],[object Object],[object Object],[object Object]
Computational Linguistics, 29 , Issue 3,  September 2003 .
Research at  Web   Knowledge   Discovery  Lab
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object]
LiveTrans:  Cross-language Web Search
LiveClassifier : Classifying search results into user-defined classification tree
LiveClassifier  :  Paper Title Categorization Note: no labeled training data
LiveCluster :  Taxonomy Generation
Terms Clustering
Query Clustering   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
 
Outline ,[object Object],[object Object],[object Object]
Translating Unknown Queries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Note: First work dealing with online translation
Introduction (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館  Digital library 班夫 / 班芙   Banff 石川県   Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre  museum Chinese Translation English Terminologies
Web Mining of  Query Translations ,[object Object],Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result  Mining OOD Yahoo <->  雅虎
Anchor Text (Yahoo <->  雅虎 ) ,[object Object],[object Object]
Search Result Page  (National Palace Museum vs.  故宮博物院 ) ,[object Object]
Problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Extraction: SCPCD
… … Term Selection:   Probabilistic Inference Model Page Authority Co-occurrence Page Rank ,[object Object],[object Object]
Observation of Anchor Text Source Term(Ts)  Translation(Tt) 雅虎 => Yahoo
-  in USA Taiwan  - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
-  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set  Observation of Anchor Text
…… (#in-link= 187) …… (#in-link= 21) -  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term  Selection
Term Selection  ,[object Object],[object Object],[object Object],[object Object],Query S . . . T 1 T 2 T n
Chi-Square Test ,[object Object],a : # of pages containing both terms  s  and  t b : # of pages containing term  s  but not  t c : # of pages containing term   t  but not  s d : # of pages containing neither term  s  nor  t N : the total number of pages, i.e.,  N =  a + b + c + d
Context Vector Analysis ,[object Object],[object Object]
Indirect Association Problem   Cisco s t s 1 t 1 系統  (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科  (Cisco)
Competitive Linking Algorithm t 1 system s t 2 系統   (system) Cisco 資訊   (information) 網路   (network) 電腦   (computer) St 1 思科   (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
Combined Method ,[object Object],[object Object],[object Object],R m (s,t)  : Ranking of score  in different methods
Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random Query Test Set ,[object Object],72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2.  Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
Other Experiments ,[object Object],[object Object]
Transitive Translation Top-n inclusion rates  obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
Transitive Translation Model
Chinese-Japanese Translation   61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect  15.1% 15.1% 14.3% 12.8% 10.5% Direct  Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Translation Lexicons with Regional Variations   (a)  Taiwan  (b)  Mainland China  (c)  Hong Kong Figure 1:  E xample s   of  search-result page s   in different Chinese regions that were obtained via  the English query  words  “ George Bush ”  from Google.
Summary  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
LiveCluster:  Generating Taxonomy from terms or documents
Taxonomy Generation from Terms
Hierarchical Query Clustering
The Steps   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Feature Extraction   ,[object Object],Creative Nude Photography Network -- Fine Art Nude and  ...   ...  The Creative  Nude  and  Erotic Photography  Network is the number one net portal to the best in fine art  nude  and  erotic photography ! Over 100 CNPN Member Sites  ...   Nude Places ...  to be  naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and  nude  photography are all enjoyed in the  nude . 60 minutes $39.95.  ...   A Brave Nude World ...  A Brave  Nude  World! Warning: This site contains links to fine art  nude  &  erotic photography . If you are under 18 or do not wish to view this material, You can  ...   nude Co-occurred  feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
Term Weighting
Extraction of Basic Feature Terms ,[object Object],[object Object],[object Object],[object Object]
Task I: Query Clustering   (Cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Similarity
Hierarchical Term Clustering   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Clustering Results   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
Cluster Partition
Quality Function
Quality Function  (Cont.)
Quality Function  (Cont.)
Preliminary Experiment ,[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: F-Measure
Obtained F-Measures
 
Results of Hierarchical Structure Generation

Weitere ähnliche Inhalte

Ähnlich wie Web Search And Mining (Ntuim)

Web Search For Web Searching Engine Essay
Web Search For Web Searching Engine EssayWeb Search For Web Searching Engine Essay
Web Search For Web Searching Engine EssayKaren Nelson
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebLeeFeigenbaum
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Socialmediaintransition
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Week 6
Week 6Week 6
Week 6A VD
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyMarcus Vannini
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shancjin cheng
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketingPer Axbom
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week TalkDorai Thodla
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 

Ähnlich wie Web Search And Mining (Ntuim) (20)

Internet research
Internet researchInternet research
Internet research
 
Internet research for HRD Profession
Internet research for HRD ProfessionInternet research for HRD Profession
Internet research for HRD Profession
 
Internet Research
Internet ResearchInternet Research
Internet Research
 
Web Search For Web Searching Engine Essay
Web Search For Web Searching Engine EssayWeb Search For Web Searching Engine Essay
Web Search For Web Searching Engine Essay
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic Web
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Social
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Week 6
Week 6Week 6
Week 6
 
Web 20 For Acra
Web 20 For AcraWeb 20 For Acra
Web 20 For Acra
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Web2.0!
Web2.0!Web2.0!
Web2.0!
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketing
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week Talk
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 

Kürzlich hochgeladen

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Kürzlich hochgeladen (20)

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

Web Search And Mining (Ntuim)

  • 1. Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
  • 2.
  • 3. WSE = Google Globalization!
  • 5. Problems of WSE Inside WSE . Fast . Coverage . Accuracy
  • 6. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competitions
  • 7. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competition Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
  • 8. I. Some Must-Know Statistics
  • 9.
  • 10.
  • 11. Web Content Source: Network Wizards Jan 99 Internet Domain Survey More and more non-English pages
  • 12. Web Users and Pages (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
  • 13. Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. II. Inside WSE
  • 19.
  • 20. Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
  • 21.
  • 22.
  • 24. Data Structure Lexicon: fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID
  • 25.
  • 27.
  • 28.
  • 29.
  • 31.
  • 32. Company Facts Employees: 1,300+ Languages spoken: 34 Worldwide Offices: 21 (Mostly in US & Europe) Annual Revenues: $900m
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44. V. New Gen. of WSE
  • 45.
  • 46.
  • 47.  
  • 48.
  • 52.
  • 53. VI. Web Mining
  • 54. Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
  • 55. Improving Search via Mining Millions of Users Web texts, images, logs … Search Engine Knowledge Discovery
  • 56. Valuable Web Resources Web logs, texts, images , … Knowledge Discovery Millions of Users Hyper Links Anchor Texts Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
  • 57. Discovered Knowledge Web logs, texts, images , … Knowledge Discovery Millions of Users Users’ Preferences/Need: Topic, Location, Timing, … Authority/Popularity: Site, File, People, Company, Product Clusters/Associations/ Relations: Site, Page, People, Company, Product, Query
  • 58. Web Mining for IR Web logs, texts, images , … Knowledge Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction Text mining Filtering
  • 59.
  • 60. Computational Linguistics, 29 , Issue 3, September 2003 .
  • 61. Research at Web Knowledge Discovery Lab
  • 62.
  • 63.
  • 65. LiveClassifier : Classifying search results into user-defined classification tree
  • 66. LiveClassifier : Paper Title Categorization Note: no labeled training data
  • 67. LiveCluster : Taxonomy Generation
  • 69. Query Clustering 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 70.  
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 79.
  • 80. Observation of Anchor Text Source Term(Ts) Translation(Tt) 雅虎 => Yahoo
  • 81. - in USA Taiwan - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
  • 82. - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set Observation of Anchor Text
  • 83. …… (#in-link= 187) …… (#in-link= 21) - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
  • 84. Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term Selection
  • 85.
  • 86.
  • 87.
  • 88. Indirect Association Problem Cisco s t s 1 t 1 系統 (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科 (Cisco)
  • 89. Competitive Linking Algorithm t 1 system s t 2 系統 (system) Cisco 資訊 (information) 網路 (network) 電腦 (computer) St 1 思科 (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
  • 90.
  • 91.
  • 92.
  • 93.
  • 94. Transitive Translation Top-n inclusion rates obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
  • 96. Chinese-Japanese Translation 61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect 15.1% 15.1% 14.3% 12.8% 10.5% Direct Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
  • 97. Translation Lexicons with Regional Variations (a) Taiwan (b) Mainland China (c) Hong Kong Figure 1: E xample s of search-result page s in different Chinese regions that were obtained via the English query words “ George Bush ” from Google.
  • 98.
  • 99. LiveCluster: Generating Taxonomy from terms or documents
  • 102.
  • 103.
  • 105.
  • 106.
  • 108.
  • 109.  
  • 110. Clustering Results 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 113. Quality Function (Cont.)
  • 114. Quality Function (Cont.)
  • 115.
  • 118.  
  • 119. Results of Hierarchical Structure Generation