SlideShare ist ein Scribd-Unternehmen logo
1 von 26
at io n
In fo rm          l
          ri ev a
     R et
            en ges
   C h a ll           Bruno Pedro
                        March 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
3 Challenges
• Real-Time Retrieval
• Understanding Context
• Inferring Identify
Real-Time Retrieval




             http://www.flickr.com/photos/josephrobertson/127758523/
WordPress


                     source: wordpress.com




• Average ~10K posts/hour
• ~3 new posts every second
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Context




     For me context is the key - from
     that comes the understanding of
     everything. — Kenneth Noland
source: joelonsoftware.com




Dogs?
source: Google Reader Play




Dogs with unmatched title?
source: Google Buzz




Still doesn’t make a lot of sense...
This is the
worst case
scenario
Challenge
• Find context from associated content:
 • Pictures
 • Comments
 • Location information
 • Timelines
 • Authors
Strategy
• Associate content through common
  identifiers
• Establish timeline of different pieces
• Group pieces by same author
• Present in a comprehensible fashion
Identity




           source: abc Australia
Many Identifiers
• E-mail: user@example.com
• facebook: @User Name
• flickr: user or User Name (?)
• Google Buzz: @user@example.com
• twitter: @user
 ...
Addressable
• http://facebook.com/user
• http://flickr.com/user
• http://www.google.com/profiles/user
• http://twitter.com/user
  ...
How to make sense?
Challenge
• Parse every message, tweet or post
• Find possible user identifiers
• Substitute for meaningful information:
 • A link to the original profile
 • Equivalent identity on destination
Strategy
• Decentralized processing:
 • Browser based (plugin)
 • Extract identities from page
 • Process
 • Replace with meaningful information
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• Web Finger
  http://webfinger.org/
tarpipe streamlines your     tarpipe is one of the most   Today I had a chance to
updates to various social    curious experiments in       spend time experimenting
web sites, creating simple   social media that I've       with tarpipe and I have to
or complex workflows to       seen lately. The service     say that I am intrigued by
update several buckets in    has the potential to be      the concept and impressed
one fell swoop.              the answer to the lament     by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal     Jeff Barr
lifehacker                   syndication overload.        Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                                 share your life

Weitere ähnliche Inhalte

Ähnlich wie Information Retrieval Challenges

Triple your blog post frequency
Triple your blog post frequencyTriple your blog post frequency
Triple your blog post frequencyAndraz Tori
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Blogging for a better classroom
Blogging for a better classroomBlogging for a better classroom
Blogging for a better classroomVicki Davis
 
Technical Communication for Unity Developers
Technical Communication for Unity DevelopersTechnical Communication for Unity Developers
Technical Communication for Unity DevelopersUnity Technologies
 
Real-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsReal-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsAttensity
 
Write a better FM
Write a better FMWrite a better FM
Write a better FMRich Bowen
 
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPointLaDonna Coy
 
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Anselm Hook
 
Nguyen phuong truong anh a story of bug bounty hunter
Nguyen phuong truong anh   a story of bug bounty hunterNguyen phuong truong anh   a story of bug bounty hunter
Nguyen phuong truong anh a story of bug bounty hunterSecurity Bootcamp
 
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...sewilkie
 
Social Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesSocial Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesHarvardComms
 
Code Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussCode Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussJapheth Thomson
 
Keith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith Jones, PhD
 
项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 topgeek
 

Ähnlich wie Information Retrieval Challenges (20)

From OSINT to Phishing presentation
From OSINT to Phishing presentationFrom OSINT to Phishing presentation
From OSINT to Phishing presentation
 
Doonish
DoonishDoonish
Doonish
 
Triple your blog post frequency
Triple your blog post frequencyTriple your blog post frequency
Triple your blog post frequency
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Blogging for a better classroom
Blogging for a better classroomBlogging for a better classroom
Blogging for a better classroom
 
Fighting Spam at Flickr
Fighting Spam at FlickrFighting Spam at Flickr
Fighting Spam at Flickr
 
Technical Communication for Unity Developers
Technical Communication for Unity DevelopersTechnical Communication for Unity Developers
Technical Communication for Unity Developers
 
Real-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsReal-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social Analytics
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPoint
 
Ideation,demos
Ideation,demosIdeation,demos
Ideation,demos
 
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
 
Nguyen phuong truong anh a story of bug bounty hunter
Nguyen phuong truong anh   a story of bug bounty hunterNguyen phuong truong anh   a story of bug bounty hunter
Nguyen phuong truong anh a story of bug bounty hunter
 
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
 
Social Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesSocial Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation Slides
 
Code Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussCode Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, Discuss
 
Keith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysis
 
项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通
 

Mehr von Bruno Pedro

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIsBruno Pedro
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an APIBruno Pedro
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an APIBruno Pedro
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an APIBruno Pedro
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to ChatBruno Pedro
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API TestingBruno Pedro
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejsBruno Pedro
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API DiscoveryBruno Pedro
 
Api Design & The Paris Subway
Api Design & The Paris SubwayApi Design & The Paris Subway
Api Design & The Paris SubwayBruno Pedro
 
The importance of /me
The importance of /meThe importance of /me
The importance of /meBruno Pedro
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumersBruno Pedro
 
API Code Generation
API Code GenerationAPI Code Generation
API Code GenerationBruno Pedro
 
Bridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBruno Pedro
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?Bruno Pedro
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?Bruno Pedro
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demoBruno Pedro
 
Everything OAuth
Everything OAuthEverything OAuth
Everything OAuthBruno Pedro
 
Activity Streams And Contexts
Activity Streams And ContextsActivity Streams And Contexts
Activity Streams And ContextsBruno Pedro
 

Mehr von Bruno Pedro (20)

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIs
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an API
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an API
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an API
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
 
Api Design & The Paris Subway
Api Design & The Paris SubwayApi Design & The Paris Subway
Api Design & The Paris Subway
 
The importance of /me
The importance of /meThe importance of /me
The importance of /me
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumers
 
API Code Generation
API Code GenerationAPI Code Generation
API Code Generation
 
Bridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and Customers
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?
 
node-fs
node-fsnode-fs
node-fs
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
 
OAuth checklist
OAuth checklistOAuth checklist
OAuth checklist
 
Everything OAuth
Everything OAuthEverything OAuth
Everything OAuth
 
Activity Streams And Contexts
Activity Streams And ContextsActivity Streams And Contexts
Activity Streams And Contexts
 

Kürzlich hochgeladen

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Kürzlich hochgeladen (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Information Retrieval Challenges

  • 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  • 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  • 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  • 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  • 14. source: Google Reader Play Dogs with unmatched title?
  • 15. source: Google Buzz Still doesn’t make a lot of sense...
  • 16. This is the worst case scenario
  • 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  • 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  • 19. Identity source: abc Australia
  • 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  • 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  • 22. How to make sense?
  • 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  • 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  • 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  • 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life

Hinweis der Redaktion

  1. - Google Chrome plugin - Get identity information from Google Social Graph API or other means