SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Asking the Right Questions
• Wisdom of Crowds in the Web
• The Long Tail
• Issues and Examples
• Concluding Remarks
6/28/13
2
- 4 -
4
Big Data
§  Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
§  Large volume and growth
§  Petabytes to exabytes
§  Growth is estimated in 3 exabytes per day
§  Structured vs. non-structured data
§  Diversity
§  Types, formats, complexity, topics, etc.
§  Best Public Data Example: The Web
§  Content: text, multimedia
§  Structure: graphs
§  Usage: real time streams
- 5 -
5
Big Data
§  Focus on analytics
§  Many storage technologies:
§  DBs, DWs, distributed file systems, …
§  Many processing technologies:
§  Cloud computing, map-reduce (Hadoop), …
§  Data mining, clustering, classification, …
§  Machine learning, A/B testing, NLP, …
§  Simulation
§  Several technology providers
§  Initial best practices (see TDWI report, 2011)
§  Main challenges: scalability, online
6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
§  Problem Driven
§  What data we need? How much?
§  How we collect it? How we store and transfer it?
§  Understanding the Data
§  How sparse is the data? How much noise?
§  There is redundancy? There are biases?
§  There is spam? Any outliers?
§  Analyzing the Data
§  Any privacy issues? Do we need to anonymize?
§  How well our algorithms scale?
§  Can we visualize the results?
6/28/13
4
- 8 -
8
Too Much Data Available
§  The Web is a database!
§  Data does not imply information
§  Many analyses for the sake of it (data driven)
§  Analyzing data is not CS per se
§  Publish in the right forum!
§  Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
6/28/13
6
- 13 -
Quantity
Quality
User-
generated
Traditional
publishing
What is in the Web? How Good it is?
- 14 -
14
What else is in the Web?
6/28/13
7
- 15 -
15
Noise and Spam
§  Noise may come from many places:
§  Instruments that measure
§  How we interpret the data (example later)
§  Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicks…
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
•  Social
•  Economical
Web Spam is NOT Mail Spam
6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
•  User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time (people + physical sensors)
•  Impact
– Fragmentation of ownership
– Fragmentation of access (longer heavy tail)
– Fragmentation of right to access
•  Viability
– Business model based in advertising
- 20 -
The Wisdom of Crowds
•  James Surowiecki, a New Yorker columnist,
published this book in 2004
– “Under the right circumstances, groups are
remarkably intelligent”
•  Importance of diversity, independence and
decentralization
“large groups of people are smarter than an elite few,
no matter how brilliant—they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the future”.
Aggregating data
6/28/13
10
- 21 -
21
Web Data Mining
•  Content: text & multimedia mining
•  Structure: link analysis, graph mining
•  Usage: log analysis, query mining
•  Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communities
•  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
•  Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
•  Like outsourcing, but in a micro-distributed fashion
•  Thousands of “turkers” working on hundreds of “HITS” (tasks)
•  Rates are typically few cents per task
•  Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking
– Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers
– Tags: Web Taggers
– Queries: All Web Users!
• Queries and actions (or no action!)‫‏‬
The crowd implicitly
knows the experts!
6/28/13
13
- 30 -
30
Scalability
§  How to scale?
§  Doubling the data in the best case will double the time
§  Time complexity vs. result quality trade-off
§  Example: entity detection in linear time at almost state
of the art quality
§  That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
§  Distributed parallel processing
§  Map-reduce not always works
§  Parallelism is problem dependent
§  Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§  There is any dependency in the data?
§  There is any duplication?
§  Lexical duplication in the Web is around 25%
§  Semantic duplication is larger
§  Are there any biases?
§  Example 1: clicks in search engines
§  Bias to the ranking and the interface
§  There is a ranking bias in the Web content
§  Example 2: tag recommendation
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from “numb
fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several
people with the last name Arnold and “homes sold
in shadow lake subdivision gwinnett county
georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
“address the collection of data
itself and not just how the
data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
•  Gender: 84%
•  Age (±10): 79%
•  Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
•  Partial name: 8.9%
•  Complete: 1.2%
More information:
•  A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
6/28/13
16
- 36 -
36
Sparsity
§  The Long Tail is always Sparse
§  Why there is a long tail?
§  When the crowd dominates
§  Empowering the tail
§  Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity
– Quality
– Coverage
Long tail
Heavy tail
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, …
Normal
people
Weirdos
One explanation
6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipf’s principle
of minimal effort)
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser
•  Avoid the Poor get Poorer Syndrome
Solutions:
•  Diversity
•  Novelty
•  Serendipity
46
Explore & Exploit
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sparse
Aggregate users around same intent, task, facet, ….
Change granularity “ad hoc”
•  Middle age men
•  Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
•  Optimal Touristic Paths from Flickr
•  Good for tourists and locals
De Choudhury et al, HT 2010
6/28/13
21
- 49 -
•  The long tail is important not only for e-
commerce, but because we are all there
•  Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and
sociological reality
l  Data must be interesting! (Gerhard Weikum)
l  Problem driven
l  Plenty of challenges
6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: rbaeza@acm.org
Thanks to many people at Yahoo! Labs
ASIST 2012
Book of the
Year Award
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataMartin Patrick
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on educationCraig Cunningham
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?David Smith
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?Brian Vetruba
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012Paige Jaeger
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsAnant Narayanan
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)KR_Barker
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...Frederick Zarndt
 
googlization of information
googlization of informationgooglization of information
googlization of informationrajat00001in
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)KR_Barker
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberNelson Piedra
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)KR_Barker
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...bakers84
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the NetYesha
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1KR_Barker
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data VisualizationJournovationSU
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...KR_Barker
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)KR_Barker
 

Was ist angesagt? (19)

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
 
googlization of information
googlization of informationgooglization of information
googlization of information
 
NCTI
NCTINCTI
NCTI
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
 

Andere mochten auch

Lesson five
Lesson fiveLesson five
Lesson fivecoxx201
 
Lesson six
Lesson sixLesson six
Lesson sixcoxx201
 
Обзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрииОбзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрииktoropetsky
 

Andere mochten auch (8)

Problem b6
Problem b6Problem b6
Problem b6
 
Abyrvalg
AbyrvalgAbyrvalg
Abyrvalg
 
Lesson five
Lesson fiveLesson five
Lesson five
 
Lesson six
Lesson sixLesson six
Lesson six
 
Biosilver 22
Biosilver 22Biosilver 22
Biosilver 22
 
Homepure
HomepureHomepure
Homepure
 
Обзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрииОбзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрии
 
E guard كيونت
E guard كيونتE guard كيونت
E guard كيونت
 

Ähnlich wie Big data in the web

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programmingMia
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiLaks Lakshmanan
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social NetworksEhren Foss
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic webTony Dobaj
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for BusinessClement Levallois
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4heyramzz
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social SciencesAbe Usher
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdfKayKay751113
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdfBrajKishor45
 

Ähnlich wie Big data in the web (20)

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-ii
 
Tf gsds
Tf gsdsTf gsds
Tf gsds
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
Data Science
Data Science Data Science
Data Science
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
 

Kürzlich hochgeladen

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Big data in the web

  • 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda • Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
  • 2. 6/28/13 2 - 4 - 4 Big Data §  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time §  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data §  Diversity §  Types, formats, complexity, topics, etc. §  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams - 5 - 5 Big Data §  Focus on analytics §  Many storage technologies: §  DBs, DWs, distributed file systems, … §  Many processing technologies: §  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation §  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online
  • 3. 6/28/13 3 - 6 - 6 Big Data: The Five V’s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions §  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it? §  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers? §  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?
  • 4. 6/28/13 4 - 8 - 8 Too Much Data Available §  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se §  Publish in the right forum! §  Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  • 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  • 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  • 7. 6/28/13 7 - 15 - 15 Noise and Spam §  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later) §  Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicks… due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? •  Social •  Economical Web Spam is NOT Mail Spam
  • 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  • 9. 6/28/13 9 - 19 - Web Data Trends •  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors) •  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access •  Viability – Business model based in advertising - 20 - The Wisdom of Crowds •  James Surowiecki, a New Yorker columnist, published this book in 2004 – “Under the right circumstances, groups are remarkably intelligent” •  Importance of diversity, independence and decentralization “large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”. Aggregating data
  • 10. 6/28/13 10 - 21 - 21 Web Data Mining •  Content: text & multimedia mining •  Structure: link analysis, graph mining •  Usage: log analysis, query mining •  Relate all of the above – Web characterization – Particular applications - 22 - Flickr: Clustering Pictures 22
  • 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  • 12. 6/28/13 12 - 27 - “Crowd Sourcing” Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion •  Thousands of “turkers” working on hundreds of “HITS” (tasks) •  Rates are typically few cents per task •  Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds – Crucial for Search Ranking – Text: Web Writers & Editors • not only for the Web! – Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users! • Queries and actions (or no action!)‫‏‬ The crowd implicitly knows the experts!
  • 13. 6/28/13 13 - 30 - 30 Scalability §  How to scale? §  Doubling the data in the best case will double the time §  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state of the art quality §  That implies that there exists a text size n* for which the linear algorithm will produce more correct entities §  Distributed parallel processing §  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach - 31 - 31 Redundancy and Bias §  There is any dependency in the data? §  There is any duplication? §  Lexical duplication in the Web is around 25% §  Semantic duplication is larger §  Are there any biases? §  Example 1: clicks in search engines §  Bias to the ranking and the interface §  There is a ranking bias in the Web content §  Example 2: tag recommendation
  • 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”. Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  • 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2% More information: •  A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  • 16. 6/28/13 16 - 36 - 36 Sparsity §  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail §  Example: Relations from Query Logs - 38 - The Wisdom of Crowds – Popularity – Diversity – Quality – Coverage Long tail Heavy tail
  • 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, … Normal people Weirdos One explanation
  • 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, … We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipf’s principle of minimal effort)
  • 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) “shwarzneger” example 45 - 46 - Empowering the Tail The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity 46 Explore & Exploit
  • 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse Aggregate users around same intent, task, facet, …. Change granularity “ad hoc” •  Middle age men •  Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data •  Optimal Touristic Paths from Flickr •  Good for tourists and locals De Choudhury et al, HT 2010
  • 21. 6/28/13 21 - 49 - •  The long tail is important not only for e- commerce, but because we are all there •  Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue l The Web is scientifically young l The Web is intellectually diverse l The technology mirrors the economic, legal and sociological reality l  Data must be interesting! (Gerhard Weikum) l  Problem driven l  Plenty of challenges
  • 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  • 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?