SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Asking the Right Questions
• Wisdom of Crowds in the Web
• The Long Tail
• Issues and Examples
• Concluding Remarks
6/28/13
2
- 4 -
4
Big Data
§  Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
§  Large volume and growth
§  Petabytes to exabytes
§  Growth is estimated in 3 exabytes per day
§  Structured vs. non-structured data
§  Diversity
§  Types, formats, complexity, topics, etc.
§  Best Public Data Example: The Web
§  Content: text, multimedia
§  Structure: graphs
§  Usage: real time streams
- 5 -
5
Big Data
§  Focus on analytics
§  Many storage technologies:
§  DBs, DWs, distributed file systems, …
§  Many processing technologies:
§  Cloud computing, map-reduce (Hadoop), …
§  Data mining, clustering, classification, …
§  Machine learning, A/B testing, NLP, …
§  Simulation
§  Several technology providers
§  Initial best practices (see TDWI report, 2011)
§  Main challenges: scalability, online
6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
§  Problem Driven
§  What data we need? How much?
§  How we collect it? How we store and transfer it?
§  Understanding the Data
§  How sparse is the data? How much noise?
§  There is redundancy? There are biases?
§  There is spam? Any outliers?
§  Analyzing the Data
§  Any privacy issues? Do we need to anonymize?
§  How well our algorithms scale?
§  Can we visualize the results?
6/28/13
4
- 8 -
8
Too Much Data Available
§  The Web is a database!
§  Data does not imply information
§  Many analyses for the sake of it (data driven)
§  Analyzing data is not CS per se
§  Publish in the right forum!
§  Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
6/28/13
6
- 13 -
Quantity
Quality
User-
generated
Traditional
publishing
What is in the Web? How Good it is?
- 14 -
14
What else is in the Web?
6/28/13
7
- 15 -
15
Noise and Spam
§  Noise may come from many places:
§  Instruments that measure
§  How we interpret the data (example later)
§  Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicks…
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
•  Social
•  Economical
Web Spam is NOT Mail Spam
6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
•  User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time (people + physical sensors)
•  Impact
– Fragmentation of ownership
– Fragmentation of access (longer heavy tail)
– Fragmentation of right to access
•  Viability
– Business model based in advertising
- 20 -
The Wisdom of Crowds
•  James Surowiecki, a New Yorker columnist,
published this book in 2004
– “Under the right circumstances, groups are
remarkably intelligent”
•  Importance of diversity, independence and
decentralization
“large groups of people are smarter than an elite few,
no matter how brilliant—they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the future”.
Aggregating data
6/28/13
10
- 21 -
21
Web Data Mining
•  Content: text & multimedia mining
•  Structure: link analysis, graph mining
•  Usage: log analysis, query mining
•  Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communities
•  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
•  Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
•  Like outsourcing, but in a micro-distributed fashion
•  Thousands of “turkers” working on hundreds of “HITS” (tasks)
•  Rates are typically few cents per task
•  Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking
– Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers
– Tags: Web Taggers
– Queries: All Web Users!
• Queries and actions (or no action!)‫‏‬
The crowd implicitly
knows the experts!
6/28/13
13
- 30 -
30
Scalability
§  How to scale?
§  Doubling the data in the best case will double the time
§  Time complexity vs. result quality trade-off
§  Example: entity detection in linear time at almost state
of the art quality
§  That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
§  Distributed parallel processing
§  Map-reduce not always works
§  Parallelism is problem dependent
§  Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§  There is any dependency in the data?
§  There is any duplication?
§  Lexical duplication in the Web is around 25%
§  Semantic duplication is larger
§  Are there any biases?
§  Example 1: clicks in search engines
§  Bias to the ranking and the interface
§  There is a ranking bias in the Web content
§  Example 2: tag recommendation
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from “numb
fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several
people with the last name Arnold and “homes sold
in shadow lake subdivision gwinnett county
georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
“address the collection of data
itself and not just how the
data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
•  Gender: 84%
•  Age (±10): 79%
•  Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
•  Partial name: 8.9%
•  Complete: 1.2%
More information:
•  A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
6/28/13
16
- 36 -
36
Sparsity
§  The Long Tail is always Sparse
§  Why there is a long tail?
§  When the crowd dominates
§  Empowering the tail
§  Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity
– Quality
– Coverage
Long tail
Heavy tail
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, …
Normal
people
Weirdos
One explanation
6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipf’s principle
of minimal effort)
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser
•  Avoid the Poor get Poorer Syndrome
Solutions:
•  Diversity
•  Novelty
•  Serendipity
46
Explore & Exploit
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sparse
Aggregate users around same intent, task, facet, ….
Change granularity “ad hoc”
•  Middle age men
•  Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
•  Optimal Touristic Paths from Flickr
•  Good for tourists and locals
De Choudhury et al, HT 2010
6/28/13
21
- 49 -
•  The long tail is important not only for e-
commerce, but because we are all there
•  Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and
sociological reality
l  Data must be interesting! (Gerhard Weikum)
l  Problem driven
l  Plenty of challenges
6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: rbaeza@acm.org
Thanks to many people at Yahoo! Labs
ASIST 2012
Book of the
Year Award
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataMartin Patrick
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on educationCraig Cunningham
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?David Smith
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?Brian Vetruba
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012Paige Jaeger
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsAnant Narayanan
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)KR_Barker
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...Frederick Zarndt
 
googlization of information
googlization of informationgooglization of information
googlization of informationrajat00001in
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)KR_Barker
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberNelson Piedra
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)KR_Barker
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...bakers84
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the NetYesha
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1KR_Barker
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data VisualizationJournovationSU
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...KR_Barker
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)KR_Barker
 

Was ist angesagt? (19)

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
 
googlization of information
googlization of informationgooglization of information
googlization of information
 
NCTI
NCTINCTI
NCTI
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
 

Andere mochten auch

Tommi kramer 2013-06-21-caise-re2-kramer
Tommi kramer   2013-06-21-caise-re2-kramerTommi kramer   2013-06-21-caise-re2-kramer
Tommi kramer 2013-06-21-caise-re2-kramercaise2013vlc
 
Ignacio panach ormeño et-al_caise2013
Ignacio panach   ormeño et-al_caise2013Ignacio panach   ormeño et-al_caise2013
Ignacio panach ormeño et-al_caise2013caise2013vlc
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnetcaise2013vlc
 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013caise2013vlc
 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-secaise2013vlc
 
Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_caise2013vlc
 

Andere mochten auch (6)

Tommi kramer 2013-06-21-caise-re2-kramer
Tommi kramer   2013-06-21-caise-re2-kramerTommi kramer   2013-06-21-caise-re2-kramer
Tommi kramer 2013-06-21-caise-re2-kramer
 
Ignacio panach ormeño et-al_caise2013
Ignacio panach   ormeño et-al_caise2013Ignacio panach   ormeño et-al_caise2013
Ignacio panach ormeño et-al_caise2013
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013
 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-se
 
Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_
 

Ähnlich wie Keynote baezayates

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programmingMia
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiLaks Lakshmanan
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social NetworksEhren Foss
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic webTony Dobaj
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for BusinessClement Levallois
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4heyramzz
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social SciencesAbe Usher
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdfKayKay751113
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdfBrajKishor45
 

Ähnlich wie Keynote baezayates (20)

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-ii
 
Tf gsds
Tf gsdsTf gsds
Tf gsds
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
Data Science
Data Science Data Science
Data Science
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
 

Mehr von caise2013vlc

Markus keuneke partial data-models
Markus keuneke   partial data-modelsMarkus keuneke   partial data-models
Markus keuneke partial data-modelscaise2013vlc
 
Jelena zdravkovic c ai-se 2013 capability caas
Jelena zdravkovic  c ai-se 2013 capability caasJelena zdravkovic  c ai-se 2013 capability caas
Jelena zdravkovic c ai-se 2013 capability caascaise2013vlc
 
Sagar sen caise2013final
Sagar sen caise2013finalSagar sen caise2013final
Sagar sen caise2013finalcaise2013vlc
 
David aguilera presentation
David aguilera   presentationDavid aguilera   presentation
David aguilera presentationcaise2013vlc
 
Sonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalSonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalcaise2013vlc
 
Suriadi caise2013 slides
Suriadi caise2013 slidesSuriadi caise2013 slides
Suriadi caise2013 slidescaise2013vlc
 
Fadila caise2013 vf
Fadila caise2013 vfFadila caise2013 vf
Fadila caise2013 vfcaise2013vlc
 
Michael mrissa c aise
Michael mrissa c aiseMichael mrissa c aise
Michael mrissa c aisecaise2013vlc
 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013caise2013vlc
 
Ramezani taghiabadi temporal compliance checking 2
Ramezani taghiabadi   temporal compliance checking 2Ramezani taghiabadi   temporal compliance checking 2
Ramezani taghiabadi temporal compliance checking 2caise2013vlc
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handoutscaise2013vlc
 
Sonja meyer caise 2013
Sonja meyer caise 2013Sonja meyer caise 2013
Sonja meyer caise 2013caise2013vlc
 
Tony clark caise 13-presentation
Tony clark  caise 13-presentationTony clark  caise 13-presentation
Tony clark caise 13-presentationcaise2013vlc
 
Miguel goulao 2013 c-aise
Miguel goulao 2013 c-aiseMiguel goulao 2013 c-aise
Miguel goulao 2013 c-aisecaise2013vlc
 
Jorge cardoso caise-usdl-tosca-2013-06-18c
Jorge cardoso   caise-usdl-tosca-2013-06-18cJorge cardoso   caise-usdl-tosca-2013-06-18c
Jorge cardoso caise-usdl-tosca-2013-06-18ccaise2013vlc
 
Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_caise2013vlc
 
Peter sawyer caise
Peter sawyer  caisePeter sawyer  caise
Peter sawyer caisecaise2013vlc
 
Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3caise2013vlc
 

Mehr von caise2013vlc (20)

Caise panel
Caise panelCaise panel
Caise panel
 
Markus keuneke partial data-models
Markus keuneke   partial data-modelsMarkus keuneke   partial data-models
Markus keuneke partial data-models
 
Jelena zdravkovic c ai-se 2013 capability caas
Jelena zdravkovic  c ai-se 2013 capability caasJelena zdravkovic  c ai-se 2013 capability caas
Jelena zdravkovic c ai-se 2013 capability caas
 
Sagar sen caise2013final
Sagar sen caise2013finalSagar sen caise2013final
Sagar sen caise2013final
 
David aguilera presentation
David aguilera   presentationDavid aguilera   presentation
David aguilera presentation
 
Sonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalSonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_final
 
Suriadi caise2013 slides
Suriadi caise2013 slidesSuriadi caise2013 slides
Suriadi caise2013 slides
 
Fadila caise2013 vf
Fadila caise2013 vfFadila caise2013 vf
Fadila caise2013 vf
 
Michael mrissa c aise
Michael mrissa c aiseMichael mrissa c aise
Michael mrissa c aise
 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013
 
Ramezani taghiabadi temporal compliance checking 2
Ramezani taghiabadi   temporal compliance checking 2Ramezani taghiabadi   temporal compliance checking 2
Ramezani taghiabadi temporal compliance checking 2
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handouts
 
Sonja meyer caise 2013
Sonja meyer caise 2013Sonja meyer caise 2013
Sonja meyer caise 2013
 
Tony clark caise 13-presentation
Tony clark  caise 13-presentationTony clark  caise 13-presentation
Tony clark caise 13-presentation
 
Miguel goulao 2013 c-aise
Miguel goulao 2013 c-aiseMiguel goulao 2013 c-aise
Miguel goulao 2013 c-aise
 
Jorge cardoso caise-usdl-tosca-2013-06-18c
Jorge cardoso   caise-usdl-tosca-2013-06-18cJorge cardoso   caise-usdl-tosca-2013-06-18c
Jorge cardoso caise-usdl-tosca-2013-06-18c
 
Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_
 
Peter sawyer caise
Peter sawyer  caisePeter sawyer  caise
Peter sawyer caise
 
Scekic caise13-
Scekic caise13-Scekic caise13-
Scekic caise13-
 
Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3
 

Kürzlich hochgeladen

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Kürzlich hochgeladen (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Keynote baezayates

  • 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda • Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
  • 2. 6/28/13 2 - 4 - 4 Big Data §  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time §  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data §  Diversity §  Types, formats, complexity, topics, etc. §  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams - 5 - 5 Big Data §  Focus on analytics §  Many storage technologies: §  DBs, DWs, distributed file systems, … §  Many processing technologies: §  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation §  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online
  • 3. 6/28/13 3 - 6 - 6 Big Data: The Five V’s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions §  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it? §  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers? §  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?
  • 4. 6/28/13 4 - 8 - 8 Too Much Data Available §  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se §  Publish in the right forum! §  Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  • 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  • 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  • 7. 6/28/13 7 - 15 - 15 Noise and Spam §  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later) §  Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicks… due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? •  Social •  Economical Web Spam is NOT Mail Spam
  • 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  • 9. 6/28/13 9 - 19 - Web Data Trends •  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors) •  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access •  Viability – Business model based in advertising - 20 - The Wisdom of Crowds •  James Surowiecki, a New Yorker columnist, published this book in 2004 – “Under the right circumstances, groups are remarkably intelligent” •  Importance of diversity, independence and decentralization “large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”. Aggregating data
  • 10. 6/28/13 10 - 21 - 21 Web Data Mining •  Content: text & multimedia mining •  Structure: link analysis, graph mining •  Usage: log analysis, query mining •  Relate all of the above – Web characterization – Particular applications - 22 - Flickr: Clustering Pictures 22
  • 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  • 12. 6/28/13 12 - 27 - “Crowd Sourcing” Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion •  Thousands of “turkers” working on hundreds of “HITS” (tasks) •  Rates are typically few cents per task •  Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds – Crucial for Search Ranking – Text: Web Writers & Editors • not only for the Web! – Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users! • Queries and actions (or no action!)‫‏‬ The crowd implicitly knows the experts!
  • 13. 6/28/13 13 - 30 - 30 Scalability §  How to scale? §  Doubling the data in the best case will double the time §  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state of the art quality §  That implies that there exists a text size n* for which the linear algorithm will produce more correct entities §  Distributed parallel processing §  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach - 31 - 31 Redundancy and Bias §  There is any dependency in the data? §  There is any duplication? §  Lexical duplication in the Web is around 25% §  Semantic duplication is larger §  Are there any biases? §  Example 1: clicks in search engines §  Bias to the ranking and the interface §  There is a ranking bias in the Web content §  Example 2: tag recommendation
  • 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”. Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  • 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2% More information: •  A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  • 16. 6/28/13 16 - 36 - 36 Sparsity §  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail §  Example: Relations from Query Logs - 38 - The Wisdom of Crowds – Popularity – Diversity – Quality – Coverage Long tail Heavy tail
  • 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, … Normal people Weirdos One explanation
  • 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, … We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipf’s principle of minimal effort)
  • 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) “shwarzneger” example 45 - 46 - Empowering the Tail The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity 46 Explore & Exploit
  • 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse Aggregate users around same intent, task, facet, …. Change granularity “ad hoc” •  Middle age men •  Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data •  Optimal Touristic Paths from Flickr •  Good for tourists and locals De Choudhury et al, HT 2010
  • 21. 6/28/13 21 - 49 - •  The long tail is important not only for e- commerce, but because we are all there •  Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue l The Web is scientifically young l The Web is intellectually diverse l The technology mirrors the economic, legal and sociological reality l  Data must be interesting! (Gerhard Weikum) l  Problem driven l  Plenty of challenges
  • 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  • 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?