Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Text Indexing in Accumulo
1. TEXT INDEXING WITH ACCUMULO
Efficient searching in a big data world
Tomer Kishoni
March 21, 2012
2. Agenda
• Problem Statement
• Term-Based Inverted Index
• Term-Based Inverted Index and Accumulo
• Document Partitioned Index
• Document Partitioned Index and Accumulo
3. Problem
• How can we efficiently search for information in a big data
world?
• Processing time
• Network bandwidth
• How can we leverage Accumulo’s feature set to create
efficient search patterns?
4. Focus on Indexing
• Indexing your data is a great place to start
• Let’s focus on:
• Term-based inverted index
• Great for single term search
• Document partitioned index
• Great for multiple term search
5. Example Dataset
Document ID Column Value
Learning Python Author Lutz
Learning Python Summary Extensive book on …
Programming Pearls Author Bentley
Programming Pearls Summary Classic techniques to …
Computational Geometry Author Martin
Computational Geometry Summary Want to know how to …
• Dataset of books
• Author
• Book summary
• Reference the data using the document id
6. Term-Based Inverted Index
Value Column Document ID
Lutz Author Learning Python
Extensive book on … Summary Learning Python
Bentley Author Programming Pearls
Classic techniques to … Summary Programming Pearls
Martin Author Computational Geometry
Want to know how to … Summary Computational Geometry
• Reference the document id using the value
• Can split up unstructured text to search for specific terms
7. Term-Based Index and Accumulo
• Accumulo partitions data primarily on the row id
• Lexicographic sorting
• Sorting provides a much friendlier way to search data
• Accumulo provides multidimensional storage
• Row id term
• Column family column name
• Column qualifier document id
• Can normalize the data if needed
• E.g., lower case terms
8. Term-Based Index and Accumulo
Row ID Column Family Column Qualifier
bentley Author Programming Pearls
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
how Summary Computational Geometry
know Summary Computational Geometry
lutz Author Learning Python
martin Author Computational Geometry
on Summary Learning Python
techniques Summary Programming Pearls
to Summary Computational Geometry
to Summary Programming Pearls
want Summary Computational Geometry
9. Term-Based Index and Accumulo
• Utilize Accumulo’s Scanners to search for terms
// Create the scanner object
Scanner indexScanner = ...
// Set the range to the term we want to search
indexScanner.setRange("book”);
indexScanner.fetchColumnFamily("Summary");
// Get the index results
for(Entry<Key, Value> entry : indexScanner) {
Text docId = entry.getKey().getColumnQualifier();
...
}
10. Term-Based Index and Accumulo
• Can make this even better using locality groups
• Data partitioned by certain column families
• Don’t need to skip over unnecessary columns
• Scan data sequentially
Row ID Column Family Column Qualifier
bentley Author Programming Pearls
lutz Author Learning Python
martin Author Computational Geometry
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
… … …
11. Problems with Term-Based Indexing
• Term-based indexes are great for single term queries
• Inefficient at multi-term search
• The terms of a single document could be split over multiple tablets
being served by multiple tablet servers
• Need to do set operations on the client
• Inefficient use of computer resources and network bandwidth
12. Problems with Term-Based Indexing
• Inefficient at multi-term search
Search: code book doc1
doc1, doc2 doc1
Row CF CQ Row CF CQ
book summary doc1 code summary doc1
book summary doc2 left summary doc2
classic summary doc3 up summary doc3
• Wasteful to bring doc2 back
13. Document Partitioned Index
• Distributing the index by the document rather than the
term
• All terms for a document are binned together
• Since all the terms are binned together we can perform
set operations on the servers
14. Document Partitioned Index and
Accumulo
• Accumulo stores all data on the same tablet if the key has
the same row id
• Allows us to easily bin a document’s terms
• Accumulo iterators allow us to perform server-side
processing
• Allows us to easily perform set operations
• IntersectingIterator
16. Multi-Term Search with Document
Partitioned Indexes and Accumulo
• Tablet server only returns fully qualified documents
Search: code book doc1
doc1 <none>
Row CF CQ Row CF CQ
bin1 summary=book doc1 bin2 summary=book doc2
bin1 summary=code doc1 bin2 summary=classic doc3
bin2 summary=left doc2
bin2 summary=up doc3
17. Document Partitioned Index and
Accumulo with IntersectingIterators
• IntersectingIterators will check the column families for the
specified terms
// Create the scanner object
BatchScanner indexScanner = ...
// Create the term array
Text[] terms = {new Text("summary=code"),
new Text("summary=book")};
// Set the intersecting iterator
indexScanner.setScanIterators(20,
IntersectingIterator.class.getName(), "ii”);
//Set the iterator options
indexScanner.setScanIteratorOptions("ii",
IntersectingIterator.columnFamiliesOptionName,
IntersectingIterator.encodeColumns(terms));
18. Document Partitioned Index and
Accumulo with IntersectingIterators
• For a basic document partitioned index we want to scan
the entire index table
// Set the range to scan everything
indexScanner.setRanges(Collections.singleton(new Range()));
// Only fully qualified documents will return
for(Entry<Key, Value> entry : indexScanner) {
Text docId = entry.getKey().getColumnQualifier();
...
}
19. Document Partitioned Index and
Accumulo (Bonus)
• Bin id can include space, time, etc.
• Use the dynamic schema of Accumulo to your advantage
• Instead of:
• bin1, bin2, bin3
• Try out:
• 2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2
• This includes time and categories
• Set the BatchScanner’s ranges accordingly
• Avoid using two scanners to query the index table and
then the record table
• Store both the index and record data in the same table
• Need to correctly format the data and use the
FamilyIntersectingIterator
20. Summary
• Term-based inverted index
• Take the value from the record table and make it the row id in the
index table
• Great at single term queries
• Bad at multi-term queries
• Network bandwidth
• Resources
• Document Partitioned Index
• Distributing the index by the document will ensure that all terms for
a record are served by a single Tablet Server
• Leverage Iterators to do all the work server-side
• Great at multi-term queries