SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
TEXT INDEXING WITH ACCUMULO
Efficient searching in a big data world

Tomer Kishoni
March 21, 2012
Agenda
•  Problem Statement


•  Term-Based Inverted Index


•  Term-Based Inverted Index and Accumulo


•  Document Partitioned Index


•  Document Partitioned Index and Accumulo
Problem
•  How can we efficiently search for information in a big data
 world?
  •  Processing time
  •  Network bandwidth


•  How can we leverage Accumulo’s feature set to create
 efficient search patterns?
Focus on Indexing
•  Indexing your data is a great place to start


•  Let’s focus on:
   •  Term-based inverted index
    •  Great for single term search


  •  Document partitioned index
    •  Great for multiple term search
Example Dataset
Document ID              Column         Value
Learning Python          Author         Lutz
Learning Python          Summary        Extensive book on …
Programming Pearls       Author         Bentley
Programming Pearls       Summary        Classic techniques to …
Computational Geometry   Author         Martin
Computational Geometry   Summary        Want to know how to …


•  Dataset of books
   •  Author
   •  Book summary


•  Reference the data using the document id
Term-Based Inverted Index
Value                     Column           Document ID
Lutz                      Author           Learning Python
Extensive book on …       Summary          Learning Python
Bentley                   Author           Programming Pearls
Classic techniques to …   Summary          Programming Pearls
Martin                    Author           Computational Geometry
Want to know how to …     Summary          Computational Geometry


•  Reference the document id using the value


•  Can split up unstructured text to search for specific terms
Term-Based Index and Accumulo
•  Accumulo partitions data primarily on the row id
   •  Lexicographic sorting
   •  Sorting provides a much friendlier way to search data
•  Accumulo provides multidimensional storage
   •  Row id  term
   •  Column family  column name
   •  Column qualifier  document id


•  Can normalize the data if needed
   •  E.g., lower case terms
Term-Based Index and Accumulo
Row ID       Column Family   Column Qualifier
bentley      Author          Programming Pearls
book         Summary         Learning Python
classic      Summary         Programming Pearls
extensive    Summary         Learning Python
how          Summary         Computational Geometry
know         Summary         Computational Geometry
lutz         Author          Learning Python
martin       Author          Computational Geometry
on           Summary         Learning Python
techniques   Summary         Programming Pearls
to           Summary         Computational Geometry
to           Summary         Programming Pearls
want         Summary         Computational Geometry
Term-Based Index and Accumulo
•  Utilize Accumulo’s Scanners to search for terms
// Create the scanner object
Scanner indexScanner = ...

// Set the range to the term we want to search
indexScanner.setRange("book”);
indexScanner.fetchColumnFamily("Summary");

// Get the index results
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
Term-Based Index and Accumulo
•  Can make this even better using locality groups
   •  Data partitioned by certain column families
   •  Don’t need to skip over unnecessary columns
   •  Scan data sequentially

Row ID              Column Family        Column Qualifier
bentley             Author               Programming Pearls
lutz                Author               Learning Python
martin              Author               Computational Geometry
book                Summary              Learning Python
classic             Summary              Programming Pearls
extensive           Summary              Learning Python
…                   …                    …
Problems with Term-Based Indexing
•  Term-based indexes are great for single term queries


•  Inefficient at multi-term search
    •  The terms of a single document could be split over multiple tablets
       being served by multiple tablet servers
    •  Need to do set operations on the client
     •  Inefficient use of computer resources and network bandwidth
Problems with Term-Based Indexing
•  Inefficient at multi-term search

                         Search: code book                   doc1




                 doc1, doc2                  doc1


 Row       CF       CQ                       Row    CF      CQ
 book      summary doc1                      code   summary doc1
 book      summary doc2                      left   summary doc2
 classic   summary doc3                      up     summary doc3


 •  Wasteful to bring doc2 back
Document Partitioned Index
•  Distributing the index by the document rather than the
 term

•  All terms for a document are binned together


•  Since all the terms are binned together we can perform
 set operations on the servers
Document Partitioned Index and
Accumulo
•  Accumulo stores all data on the same tablet if the key has
 the same row id
  •  Allows us to easily bin a document’s terms


•  Accumulo iterators allow us to perform server-side
 processing
  •  Allows us to easily perform set operations
  •  IntersectingIterator
Document Partitioned Index and
Accumulo
Row ID       Column Family        Column Qualifier
bin1         Author=bentley       Programming Pearls
bin1         Author=lutz          Learning Python
bin1         Summary=book         Learning Python
bin1         Summary=classic      Programming Pearls
bin1         Summary=extensive    Learning Python
bin1         Summary=on           Learning Python
bin1         Summary=techniques   Programming Pearls
bin1         Summary=to           Programming Pearls
bin2         Author=martin        Computational Geometry
bin2         Summary=to           Computational Geometry
bin2         Summary=want         Computational Geometry
bin2         Summary=how          Computational Geometry
bin2         Summary=know         Computational Geometry
Multi-Term Search with Document
Partitioned Indexes and Accumulo
•  Tablet server only returns fully qualified documents

                        Search: code book                         doc1




                 doc1                       <none>


 Row CF                 CQ                  Row CF                CQ
 bin1   summary=book    doc1                bin2   summary=book   doc2
 bin1   summary=code    doc1                bin2   summary=classic doc3
                                            bin2   summary=left   doc2
                                            bin2   summary=up     doc3
Document Partitioned Index and
Accumulo with IntersectingIterators
•  IntersectingIterators will check the column families for the
 specified terms
// Create the scanner object
BatchScanner indexScanner = ...

// Create the term array
Text[] terms = {new Text("summary=code"),
               new Text("summary=book")};

// Set the intersecting iterator
indexScanner.setScanIterators(20,
       IntersectingIterator.class.getName(), "ii”);

//Set the iterator options
indexScanner.setScanIteratorOptions("ii",
       IntersectingIterator.columnFamiliesOptionName,
       IntersectingIterator.encodeColumns(terms));
Document Partitioned Index and
Accumulo with IntersectingIterators
•  For a basic document partitioned index we want to scan
 the entire index table
// Set the range to scan everything
indexScanner.setRanges(Collections.singleton(new Range()));

// Only fully qualified documents will return
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
Document Partitioned Index and
Accumulo (Bonus)
•  Bin id can include space, time, etc.
   •  Use the dynamic schema of Accumulo to your advantage
   •  Instead of:
    •  bin1, bin2, bin3
  •  Try out:
      •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2
      •  This includes time and categories
      •  Set the BatchScanner’s ranges accordingly


•  Avoid using two scanners to query the index table and
 then the record table
  •  Store both the index and record data in the same table
  •  Need to correctly format the data and use the
   FamilyIntersectingIterator
Summary
•  Term-based inverted index
    •  Take the value from the record table and make it the row id in the
       index table
    •  Great at single term queries
    •  Bad at multi-term queries
     •  Network bandwidth
     •  Resources


•  Document Partitioned Index
   •  Distributing the index by the document will ensure that all terms for
      a record are served by a single Tablet Server
   •  Leverage Iterators to do all the work server-side
   •  Great at multi-term queries

Weitere ähnliche Inhalte

Ähnlich wie Text Indexing in Accumulo

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveIntergen
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleMongoDB
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Introduction to Structured Authoring
Introduction to Structured AuthoringIntroduction to Structured Authoring
Introduction to Structured Authoringdclsocialmedia
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 

Ähnlich wie Text Indexing in Accumulo (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
search engine
search enginesearch engine
search engine
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
search.ppt
search.pptsearch.ppt
search.ppt
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Introduction to Structured Authoring
Introduction to Structured AuthoringIntroduction to Structured Authoring
Introduction to Structured Authoring
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 

Mehr von Aaron Cordova

Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data LakeAaron Cordova
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo ClustersAaron Cordova
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloAaron Cordova
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAaron Cordova
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name NodeAaron Cordova
 

Mehr von Aaron Cordova (6)

Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and Roadmap
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Text Indexing in Accumulo

  • 1. TEXT INDEXING WITH ACCUMULO Efficient searching in a big data world Tomer Kishoni March 21, 2012
  • 2. Agenda •  Problem Statement •  Term-Based Inverted Index •  Term-Based Inverted Index and Accumulo •  Document Partitioned Index •  Document Partitioned Index and Accumulo
  • 3. Problem •  How can we efficiently search for information in a big data world? •  Processing time •  Network bandwidth •  How can we leverage Accumulo’s feature set to create efficient search patterns?
  • 4. Focus on Indexing •  Indexing your data is a great place to start •  Let’s focus on: •  Term-based inverted index •  Great for single term search •  Document partitioned index •  Great for multiple term search
  • 5. Example Dataset Document ID Column Value Learning Python Author Lutz Learning Python Summary Extensive book on … Programming Pearls Author Bentley Programming Pearls Summary Classic techniques to … Computational Geometry Author Martin Computational Geometry Summary Want to know how to … •  Dataset of books •  Author •  Book summary •  Reference the data using the document id
  • 6. Term-Based Inverted Index Value Column Document ID Lutz Author Learning Python Extensive book on … Summary Learning Python Bentley Author Programming Pearls Classic techniques to … Summary Programming Pearls Martin Author Computational Geometry Want to know how to … Summary Computational Geometry •  Reference the document id using the value •  Can split up unstructured text to search for specific terms
  • 7. Term-Based Index and Accumulo •  Accumulo partitions data primarily on the row id •  Lexicographic sorting •  Sorting provides a much friendlier way to search data •  Accumulo provides multidimensional storage •  Row id  term •  Column family  column name •  Column qualifier  document id •  Can normalize the data if needed •  E.g., lower case terms
  • 8. Term-Based Index and Accumulo Row ID Column Family Column Qualifier bentley Author Programming Pearls book Summary Learning Python classic Summary Programming Pearls extensive Summary Learning Python how Summary Computational Geometry know Summary Computational Geometry lutz Author Learning Python martin Author Computational Geometry on Summary Learning Python techniques Summary Programming Pearls to Summary Computational Geometry to Summary Programming Pearls want Summary Computational Geometry
  • 9. Term-Based Index and Accumulo •  Utilize Accumulo’s Scanners to search for terms // Create the scanner object Scanner indexScanner = ... // Set the range to the term we want to search indexScanner.setRange("book”); indexScanner.fetchColumnFamily("Summary"); // Get the index results for(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ... }
  • 10. Term-Based Index and Accumulo •  Can make this even better using locality groups •  Data partitioned by certain column families •  Don’t need to skip over unnecessary columns •  Scan data sequentially Row ID Column Family Column Qualifier bentley Author Programming Pearls lutz Author Learning Python martin Author Computational Geometry book Summary Learning Python classic Summary Programming Pearls extensive Summary Learning Python … … …
  • 11. Problems with Term-Based Indexing •  Term-based indexes are great for single term queries •  Inefficient at multi-term search •  The terms of a single document could be split over multiple tablets being served by multiple tablet servers •  Need to do set operations on the client •  Inefficient use of computer resources and network bandwidth
  • 12. Problems with Term-Based Indexing •  Inefficient at multi-term search Search: code book doc1 doc1, doc2 doc1 Row CF CQ Row CF CQ book summary doc1 code summary doc1 book summary doc2 left summary doc2 classic summary doc3 up summary doc3 •  Wasteful to bring doc2 back
  • 13. Document Partitioned Index •  Distributing the index by the document rather than the term •  All terms for a document are binned together •  Since all the terms are binned together we can perform set operations on the servers
  • 14. Document Partitioned Index and Accumulo •  Accumulo stores all data on the same tablet if the key has the same row id •  Allows us to easily bin a document’s terms •  Accumulo iterators allow us to perform server-side processing •  Allows us to easily perform set operations •  IntersectingIterator
  • 15. Document Partitioned Index and Accumulo Row ID Column Family Column Qualifier bin1 Author=bentley Programming Pearls bin1 Author=lutz Learning Python bin1 Summary=book Learning Python bin1 Summary=classic Programming Pearls bin1 Summary=extensive Learning Python bin1 Summary=on Learning Python bin1 Summary=techniques Programming Pearls bin1 Summary=to Programming Pearls bin2 Author=martin Computational Geometry bin2 Summary=to Computational Geometry bin2 Summary=want Computational Geometry bin2 Summary=how Computational Geometry bin2 Summary=know Computational Geometry
  • 16. Multi-Term Search with Document Partitioned Indexes and Accumulo •  Tablet server only returns fully qualified documents Search: code book doc1 doc1 <none> Row CF CQ Row CF CQ bin1 summary=book doc1 bin2 summary=book doc2 bin1 summary=code doc1 bin2 summary=classic doc3 bin2 summary=left doc2 bin2 summary=up doc3
  • 17. Document Partitioned Index and Accumulo with IntersectingIterators •  IntersectingIterators will check the column families for the specified terms // Create the scanner object BatchScanner indexScanner = ... // Create the term array Text[] terms = {new Text("summary=code"), new Text("summary=book")}; // Set the intersecting iterator indexScanner.setScanIterators(20, IntersectingIterator.class.getName(), "ii”); //Set the iterator options indexScanner.setScanIteratorOptions("ii", IntersectingIterator.columnFamiliesOptionName, IntersectingIterator.encodeColumns(terms));
  • 18. Document Partitioned Index and Accumulo with IntersectingIterators •  For a basic document partitioned index we want to scan the entire index table // Set the range to scan everything indexScanner.setRanges(Collections.singleton(new Range())); // Only fully qualified documents will return for(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ... }
  • 19. Document Partitioned Index and Accumulo (Bonus) •  Bin id can include space, time, etc. •  Use the dynamic schema of Accumulo to your advantage •  Instead of: •  bin1, bin2, bin3 •  Try out: •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2 •  This includes time and categories •  Set the BatchScanner’s ranges accordingly •  Avoid using two scanners to query the index table and then the record table •  Store both the index and record data in the same table •  Need to correctly format the data and use the FamilyIntersectingIterator
  • 20. Summary •  Term-based inverted index •  Take the value from the record table and make it the row id in the index table •  Great at single term queries •  Bad at multi-term queries •  Network bandwidth •  Resources •  Document Partitioned Index •  Distributing the index by the document will ensure that all terms for a record are served by a single Tablet Server •  Leverage Iterators to do all the work server-side •  Great at multi-term queries