SlideShare ist ein Scribd-Unternehmen logo
1 von 24
PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi,  Kristen LeFevre, H.V. Jagadish University of Michigan 1
Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
PrivatePond Create and store a corpus of confidential hyperlinked documents  Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine    [Song 2000, Bawa 2003, Zerr 2008] 5
Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable  Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus   - Searchable   - Not confidential Outsource Encrypted Corpus - Confidential   - Not easily searched
Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations  Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent)  [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate  Document Frequency 11
Third, Set-of-words representation + Padding (BW = 3) ,[object Object],Sample Indexable Representation AAA BBB CCC BBB CCC CCC Aggregate  Document Frequency Corpus of Indexable Representations 12
Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate  Document Frequency Corpus of Indexable Representations 13
PrivatePond Indexable Representation  Impact on Search Quality ,[object Object]
  Lose term frequency
  Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full  Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT)  PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
Search Quality Metrics Indexable Representation Original  Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
Example: Search Quality Metrics ,[object Object]
N – Consider documents ranked from 1 to N
  P(N) = [gold list INTERSECT pond list] / N
  P(3) = 2/3
  Two additional metrics (included in the paper):

Weitere ähnliche Inhalte

Was ist angesagt? (6)

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 

Andere mochten auch

Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
Clarke Ching
 

Andere mochten auch (9)

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno Coenders
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recorded
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshow
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
 

Ähnlich wie PrivatePond: Outsourced Management of Web Corpuses

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
Nita Pawar
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
WanBK Leo
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 

Ähnlich wie PrivatePond: Outsourced Management of Web Corpuses (20)

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directory
 
I explore
I exploreI explore
I explore
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
 
search engine
search enginesearch engine
search engine
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
search.ppt
search.pptsearch.ppt
search.ppt
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Lecture 3 note.pptx
Lecture 3 note.pptxLecture 3 note.pptx
Lecture 3 note.pptx
 

Mehr von arnabdotorg (6)

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
yvmail
yvmailyvmail
yvmail
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

PrivatePond: Outsourced Management of Web Corpuses

  • 1. PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi, Kristen LeFevre, H.V. Jagadish University of Michigan 1
  • 2. Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
  • 3. Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
  • 4. PrivatePond Create and store a corpus of confidential hyperlinked documents Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
  • 5. PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine [Song 2000, Bawa 2003, Zerr 2008] 5
  • 6. Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
  • 7. Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
  • 8. PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
  • 9. Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus - Searchable - Not confidential Outsource Encrypted Corpus - Confidential - Not easily searched
  • 10. Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent) [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
  • 11. Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate Document Frequency 11
  • 12.
  • 13. Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate Document Frequency Corpus of Indexable Representations 13
  • 14.
  • 15. Lose term frequency
  • 16. Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
  • 17. Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
  • 18. Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT) PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
  • 19. Search Quality Metrics Indexable Representation Original Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
  • 20.
  • 21. N – Consider documents ranked from 1 to N
  • 22. P(N) = [gold list INTERSECT pond list] / N
  • 23. P(3) = 2/3
  • 24. Two additional metrics (included in the paper):
  • 26. Rank Perturbation 18
  • 27.
  • 28. PageRank is unaffected by the set-of-words representation19
  • 29.
  • 30. Padding in documents with high PageRankor low document frequency20
  • 31.
  • 32. Conclusion Present the PrivatePond architecture Outsourcing search Goal of balancing searchability and confidentiality Leverages existing search engine infrastructure Future Work: Alternative Indexable Representations 22
  • 33. more info at www.eecs.umich.edu/db 23

Hinweis der Redaktion

  1. Consider a small company’s intranetOffload management responsibilities
  2. Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  3. Traditional search architecture query returns ranked list of documents
  4. Download each encrypted document to search
  5. So not confidential?
  6. One example to strike a balance between searchability and confidentiality
  7. Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  8. Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  9. Meaning of N
  10. Bw = 1
  11. Varying confidentiality and search quality characteristics