SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
TEXT INDEXING WITH ACCUMULO
Efficient searching in a big data world

Tomer Kishoni
March 21, 2012
Agenda
•  Problem Statement


•  Term-Based Inverted Index


•  Term-Based Inverted Index and Accumulo


•  Document Partitioned Index


•  Document Partitioned Index and Accumulo
Problem
•  How can we efficiently search for information in a big data
 world?
  •  Processing time
  •  Network bandwidth


•  How can we leverage Accumulo’s feature set to create
 efficient search patterns?
Focus on Indexing
•  Indexing your data is a great place to start


•  Let’s focus on:
   •  Term-based inverted index
    •  Great for single term search


  •  Document partitioned index
    •  Great for multiple term search
Example Dataset
Document ID              Column         Value
Learning Python          Author         Lutz
Learning Python          Summary        Extensive book on …
Programming Pearls       Author         Bentley
Programming Pearls       Summary        Classic techniques to …
Computational Geometry   Author         Martin
Computational Geometry   Summary        Want to know how to …


•  Dataset of books
   •  Author
   •  Book summary


•  Reference the data using the document id
Term-Based Inverted Index
Value                     Column           Document ID
Lutz                      Author           Learning Python
Extensive book on …       Summary          Learning Python
Bentley                   Author           Programming Pearls
Classic techniques to …   Summary          Programming Pearls
Martin                    Author           Computational Geometry
Want to know how to …     Summary          Computational Geometry


•  Reference the document id using the value


•  Can split up unstructured text to search for specific terms
Term-Based Index and Accumulo
•  Accumulo partitions data primarily on the row id
   •  Lexicographic sorting
   •  Sorting provides a much friendlier way to search data
•  Accumulo provides multidimensional storage
   •  Row id  term
   •  Column family  column name
   •  Column qualifier  document id


•  Can normalize the data if needed
   •  E.g., lower case terms
Term-Based Index and Accumulo
Row ID       Column Family   Column Qualifier
bentley      Author          Programming Pearls
book         Summary         Learning Python
classic      Summary         Programming Pearls
extensive    Summary         Learning Python
how          Summary         Computational Geometry
know         Summary         Computational Geometry
lutz         Author          Learning Python
martin       Author          Computational Geometry
on           Summary         Learning Python
techniques   Summary         Programming Pearls
to           Summary         Computational Geometry
to           Summary         Programming Pearls
want         Summary         Computational Geometry
Term-Based Index and Accumulo
•  Utilize Accumulo’s Scanners to search for terms
// Create the scanner object
Scanner indexScanner = ...

// Set the range to the term we want to search
indexScanner.setRange("book”);
indexScanner.fetchColumnFamily("Summary");

// Get the index results
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
Term-Based Index and Accumulo
•  Can make this even better using locality groups
   •  Data partitioned by certain column families
   •  Don’t need to skip over unnecessary columns
   •  Scan data sequentially

Row ID              Column Family        Column Qualifier
bentley             Author               Programming Pearls
lutz                Author               Learning Python
martin              Author               Computational Geometry
book                Summary              Learning Python
classic             Summary              Programming Pearls
extensive           Summary              Learning Python
…                   …                    …
Problems with Term-Based Indexing
•  Term-based indexes are great for single term queries


•  Inefficient at multi-term search
    •  The terms of a single document could be split over multiple tablets
       being served by multiple tablet servers
    •  Need to do set operations on the client
     •  Inefficient use of computer resources and network bandwidth
Problems with Term-Based Indexing
•  Inefficient at multi-term search

                         Search: code book                   doc1




                 doc1, doc2                  doc1


 Row       CF       CQ                       Row    CF      CQ
 book      summary doc1                      code   summary doc1
 book      summary doc2                      left   summary doc2
 classic   summary doc3                      up     summary doc3


 •  Wasteful to bring doc2 back
Document Partitioned Index
•  Distributing the index by the document rather than the
 term

•  All terms for a document are binned together


•  Since all the terms are binned together we can perform
 set operations on the servers
Document Partitioned Index and
Accumulo
•  Accumulo stores all data on the same tablet if the key has
 the same row id
  •  Allows us to easily bin a document’s terms


•  Accumulo iterators allow us to perform server-side
 processing
  •  Allows us to easily perform set operations
  •  IntersectingIterator
Document Partitioned Index and
Accumulo
Row ID       Column Family        Column Qualifier
bin1         Author=bentley       Programming Pearls
bin1         Author=lutz          Learning Python
bin1         Summary=book         Learning Python
bin1         Summary=classic      Programming Pearls
bin1         Summary=extensive    Learning Python
bin1         Summary=on           Learning Python
bin1         Summary=techniques   Programming Pearls
bin1         Summary=to           Programming Pearls
bin2         Author=martin        Computational Geometry
bin2         Summary=to           Computational Geometry
bin2         Summary=want         Computational Geometry
bin2         Summary=how          Computational Geometry
bin2         Summary=know         Computational Geometry
Multi-Term Search with Document
Partitioned Indexes and Accumulo
•  Tablet server only returns fully qualified documents

                        Search: code book                         doc1




                 doc1                       <none>


 Row CF                 CQ                  Row CF                CQ
 bin1   summary=book    doc1                bin2   summary=book   doc2
 bin1   summary=code    doc1                bin2   summary=classic doc3
                                            bin2   summary=left   doc2
                                            bin2   summary=up     doc3
Document Partitioned Index and
Accumulo with IntersectingIterators
•  IntersectingIterators will check the column families for the
 specified terms
// Create the scanner object
BatchScanner indexScanner = ...

// Create the term array
Text[] terms = {new Text("summary=code"),
               new Text("summary=book")};

// Set the intersecting iterator
indexScanner.setScanIterators(20,
       IntersectingIterator.class.getName(), "ii”);

//Set the iterator options
indexScanner.setScanIteratorOptions("ii",
       IntersectingIterator.columnFamiliesOptionName,
       IntersectingIterator.encodeColumns(terms));
Document Partitioned Index and
Accumulo with IntersectingIterators
•  For a basic document partitioned index we want to scan
 the entire index table
// Set the range to scan everything
indexScanner.setRanges(Collections.singleton(new Range()));

// Only fully qualified documents will return
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
Document Partitioned Index and
Accumulo (Bonus)
•  Bin id can include space, time, etc.
   •  Use the dynamic schema of Accumulo to your advantage
   •  Instead of:
    •  bin1, bin2, bin3
  •  Try out:
      •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2
      •  This includes time and categories
      •  Set the BatchScanner’s ranges accordingly


•  Avoid using two scanners to query the index table and
 then the record table
  •  Store both the index and record data in the same table
  •  Need to correctly format the data and use the
   FamilyIntersectingIterator
Summary
•  Term-based inverted index
    •  Take the value from the record table and make it the row id in the
       index table
    •  Great at single term queries
    •  Bad at multi-term queries
     •  Network bandwidth
     •  Resources


•  Document Partitioned Index
   •  Distributing the index by the document will ensure that all terms for
      a record are served by a single Tablet Server
   •  Leverage Iterators to do all the work server-side
   •  Great at multi-term queries

Weitere ähnliche Inhalte

Ähnlich wie Text Indexing in Accumulo

Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
Ajit More
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
Steven Francia
 

Ähnlich wie Text Indexing in Accumulo (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
search engine
search enginesearch engine
search engine
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
search.ppt
search.pptsearch.ppt
search.ppt
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Introduction to Structured Authoring
Introduction to Structured AuthoringIntroduction to Structured Authoring
Introduction to Structured Authoring
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 

Mehr von Aaron Cordova (6)

Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and Roadmap
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Text Indexing in Accumulo

  • 1. TEXT INDEXING WITH ACCUMULO Efficient searching in a big data world Tomer Kishoni March 21, 2012
  • 2. Agenda •  Problem Statement •  Term-Based Inverted Index •  Term-Based Inverted Index and Accumulo •  Document Partitioned Index •  Document Partitioned Index and Accumulo
  • 3. Problem •  How can we efficiently search for information in a big data world? •  Processing time •  Network bandwidth •  How can we leverage Accumulo’s feature set to create efficient search patterns?
  • 4. Focus on Indexing •  Indexing your data is a great place to start •  Let’s focus on: •  Term-based inverted index •  Great for single term search •  Document partitioned index •  Great for multiple term search
  • 5. Example Dataset Document ID Column Value Learning Python Author Lutz Learning Python Summary Extensive book on … Programming Pearls Author Bentley Programming Pearls Summary Classic techniques to … Computational Geometry Author Martin Computational Geometry Summary Want to know how to … •  Dataset of books •  Author •  Book summary •  Reference the data using the document id
  • 6. Term-Based Inverted Index Value Column Document ID Lutz Author Learning Python Extensive book on … Summary Learning Python Bentley Author Programming Pearls Classic techniques to … Summary Programming Pearls Martin Author Computational Geometry Want to know how to … Summary Computational Geometry •  Reference the document id using the value •  Can split up unstructured text to search for specific terms
  • 7. Term-Based Index and Accumulo •  Accumulo partitions data primarily on the row id •  Lexicographic sorting •  Sorting provides a much friendlier way to search data •  Accumulo provides multidimensional storage •  Row id  term •  Column family  column name •  Column qualifier  document id •  Can normalize the data if needed •  E.g., lower case terms
  • 8. Term-Based Index and Accumulo Row ID Column Family Column Qualifier bentley Author Programming Pearls book Summary Learning Python classic Summary Programming Pearls extensive Summary Learning Python how Summary Computational Geometry know Summary Computational Geometry lutz Author Learning Python martin Author Computational Geometry on Summary Learning Python techniques Summary Programming Pearls to Summary Computational Geometry to Summary Programming Pearls want Summary Computational Geometry
  • 9. Term-Based Index and Accumulo •  Utilize Accumulo’s Scanners to search for terms // Create the scanner object Scanner indexScanner = ... // Set the range to the term we want to search indexScanner.setRange("book”); indexScanner.fetchColumnFamily("Summary"); // Get the index results for(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ... }
  • 10. Term-Based Index and Accumulo •  Can make this even better using locality groups •  Data partitioned by certain column families •  Don’t need to skip over unnecessary columns •  Scan data sequentially Row ID Column Family Column Qualifier bentley Author Programming Pearls lutz Author Learning Python martin Author Computational Geometry book Summary Learning Python classic Summary Programming Pearls extensive Summary Learning Python … … …
  • 11. Problems with Term-Based Indexing •  Term-based indexes are great for single term queries •  Inefficient at multi-term search •  The terms of a single document could be split over multiple tablets being served by multiple tablet servers •  Need to do set operations on the client •  Inefficient use of computer resources and network bandwidth
  • 12. Problems with Term-Based Indexing •  Inefficient at multi-term search Search: code book doc1 doc1, doc2 doc1 Row CF CQ Row CF CQ book summary doc1 code summary doc1 book summary doc2 left summary doc2 classic summary doc3 up summary doc3 •  Wasteful to bring doc2 back
  • 13. Document Partitioned Index •  Distributing the index by the document rather than the term •  All terms for a document are binned together •  Since all the terms are binned together we can perform set operations on the servers
  • 14. Document Partitioned Index and Accumulo •  Accumulo stores all data on the same tablet if the key has the same row id •  Allows us to easily bin a document’s terms •  Accumulo iterators allow us to perform server-side processing •  Allows us to easily perform set operations •  IntersectingIterator
  • 15. Document Partitioned Index and Accumulo Row ID Column Family Column Qualifier bin1 Author=bentley Programming Pearls bin1 Author=lutz Learning Python bin1 Summary=book Learning Python bin1 Summary=classic Programming Pearls bin1 Summary=extensive Learning Python bin1 Summary=on Learning Python bin1 Summary=techniques Programming Pearls bin1 Summary=to Programming Pearls bin2 Author=martin Computational Geometry bin2 Summary=to Computational Geometry bin2 Summary=want Computational Geometry bin2 Summary=how Computational Geometry bin2 Summary=know Computational Geometry
  • 16. Multi-Term Search with Document Partitioned Indexes and Accumulo •  Tablet server only returns fully qualified documents Search: code book doc1 doc1 <none> Row CF CQ Row CF CQ bin1 summary=book doc1 bin2 summary=book doc2 bin1 summary=code doc1 bin2 summary=classic doc3 bin2 summary=left doc2 bin2 summary=up doc3
  • 17. Document Partitioned Index and Accumulo with IntersectingIterators •  IntersectingIterators will check the column families for the specified terms // Create the scanner object BatchScanner indexScanner = ... // Create the term array Text[] terms = {new Text("summary=code"), new Text("summary=book")}; // Set the intersecting iterator indexScanner.setScanIterators(20, IntersectingIterator.class.getName(), "ii”); //Set the iterator options indexScanner.setScanIteratorOptions("ii", IntersectingIterator.columnFamiliesOptionName, IntersectingIterator.encodeColumns(terms));
  • 18. Document Partitioned Index and Accumulo with IntersectingIterators •  For a basic document partitioned index we want to scan the entire index table // Set the range to scan everything indexScanner.setRanges(Collections.singleton(new Range())); // Only fully qualified documents will return for(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ... }
  • 19. Document Partitioned Index and Accumulo (Bonus) •  Bin id can include space, time, etc. •  Use the dynamic schema of Accumulo to your advantage •  Instead of: •  bin1, bin2, bin3 •  Try out: •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2 •  This includes time and categories •  Set the BatchScanner’s ranges accordingly •  Avoid using two scanners to query the index table and then the record table •  Store both the index and record data in the same table •  Need to correctly format the data and use the FamilyIntersectingIterator
  • 20. Summary •  Term-based inverted index •  Take the value from the record table and make it the row id in the index table •  Great at single term queries •  Bad at multi-term queries •  Network bandwidth •  Resources •  Document Partitioned Index •  Distributing the index by the document will ensure that all terms for a record are served by a single Tablet Server •  Leverage Iterators to do all the work server-side •  Great at multi-term queries