Text Indexing in Accumulo

TEXT INDEXING WITH ACCUMULO
Efficient searching in a big data world

Tomer Kishoni
March 21, 2012

Agenda
•  Problem Statement

•  Term-Based Inverted Index

•  Term-Based Inverted Index and Accumulo

•  Document Partitioned Index

•  Document Partitioned Index and Accumulo

Problem
•  How can we efficiently search for information in a big data
world?
•  Processing time
•  Network bandwidth

•  How can we leverage Accumulo’s feature set to create
efficient search patterns?

Focus on Indexing
•  Indexing your data is a great place to start

•  Let’s focus on:
•  Term-based inverted index
•  Great for single term search

•  Document partitioned index
•  Great for multiple term search

Example Dataset
Document ID Column Value
Learning Python Author Lutz
Learning Python Summary Extensive book on …
Programming Pearls Author Bentley
Programming Pearls Summary Classic techniques to …
Computational Geometry Author Martin
Computational Geometry Summary Want to know how to …

•  Dataset of books
•  Author
•  Book summary

•  Reference the data using the document id

Term-Based Inverted Index
Value Column Document ID
Lutz Author Learning Python
Extensive book on … Summary Learning Python
Bentley Author Programming Pearls
Classic techniques to … Summary Programming Pearls
Martin Author Computational Geometry
Want to know how to … Summary Computational Geometry

•  Reference the document id using the value

•  Can split up unstructured text to search for specific terms

Term-Based Index and Accumulo
•  Accumulo partitions data primarily on the row id
•  Lexicographic sorting
•  Sorting provides a much friendlier way to search data
•  Accumulo provides multidimensional storage
•  Row id  term
•  Column family  column name
•  Column qualifier  document id

•  Can normalize the data if needed
•  E.g., lower case terms

Row ID Column Family Column Qualifier
bentley Author Programming Pearls
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
how Summary Computational Geometry
know Summary Computational Geometry
lutz Author Learning Python
martin Author Computational Geometry
on Summary Learning Python
techniques Summary Programming Pearls
to Summary Computational Geometry
to Summary Programming Pearls
want Summary Computational Geometry

•  Utilize Accumulo’s Scanners to search for terms
// Create the scanner object
Scanner indexScanner = ...

// Set the range to the term we want to search
indexScanner.setRange("book”);
indexScanner.fetchColumnFamily("Summary");

// Get the index results
for(Entry<Key, Value> entry : indexScanner) {
Text docId = entry.getKey().getColumnQualifier();
...
}

•  Can make this even better using locality groups
•  Data partitioned by certain column families
•  Don’t need to skip over unnecessary columns
•  Scan data sequentially

bentley Author Programming Pearls
lutz Author Learning Python
martin Author Computational Geometry
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
… … …

Problems with Term-Based Indexing
•  Term-based indexes are great for single term queries

•  Inefficient at multi-term search
•  The terms of a single document could be split over multiple tablets
being served by multiple tablet servers
•  Need to do set operations on the client
•  Inefficient use of computer resources and network bandwidth

Problems with Term-Based Indexing
•  Inefficient at multi-term search

Search: code book doc1

doc1, doc2 doc1

Row CF CQ Row CF CQ
book summary doc1 code summary doc1
book summary doc2 left summary doc2
classic summary doc3 up summary doc3

•  Wasteful to bring doc2 back

Document Partitioned Index
•  Distributing the index by the document rather than the
term

•  All terms for a document are binned together

•  Since all the terms are binned together we can perform
set operations on the servers

Document Partitioned Index and
Accumulo
•  Accumulo stores all data on the same tablet if the key has
the same row id
•  Allows us to easily bin a document’s terms

•  Accumulo iterators allow us to perform server-side
processing
•  Allows us to easily perform set operations
•  IntersectingIterator

Accumulo
bin1 Author=bentley Programming Pearls
bin1 Author=lutz Learning Python
bin1 Summary=book Learning Python
bin1 Summary=classic Programming Pearls
bin1 Summary=extensive Learning Python
bin1 Summary=on Learning Python
bin1 Summary=techniques Programming Pearls
bin1 Summary=to Programming Pearls
bin2 Author=martin Computational Geometry
bin2 Summary=to Computational Geometry
bin2 Summary=want Computational Geometry
bin2 Summary=how Computational Geometry
bin2 Summary=know Computational Geometry

Multi-Term Search with Document
Partitioned Indexes and Accumulo
•  Tablet server only returns fully qualified documents

Search: code book doc1

doc1 <none>

Row CF CQ Row CF CQ
bin1 summary=book doc1 bin2 summary=book doc2
bin1 summary=code doc1 bin2 summary=classic doc3
bin2 summary=left doc2
bin2 summary=up doc3

Accumulo with IntersectingIterators
•  IntersectingIterators will check the column families for the
specified terms
// Create the scanner object
BatchScanner indexScanner = ...

// Create the term array
Text[] terms = {new Text("summary=code"),
new Text("summary=book")};

// Set the intersecting iterator
indexScanner.setScanIterators(20,
IntersectingIterator.class.getName(), "ii”);

//Set the iterator options
indexScanner.setScanIteratorOptions("ii",
IntersectingIterator.columnFamiliesOptionName,
IntersectingIterator.encodeColumns(terms));

Accumulo with IntersectingIterators
•  For a basic document partitioned index we want to scan
the entire index table
// Set the range to scan everything
indexScanner.setRanges(Collections.singleton(new Range()));

// Only fully qualified documents will return
for(Entry<Key, Value> entry : indexScanner) {
Text docId = entry.getKey().getColumnQualifier();
...
}

Accumulo (Bonus)
•  Bin id can include space, time, etc.
•  Use the dynamic schema of Accumulo to your advantage
•  Instead of:
•  bin1, bin2, bin3
•  Try out:
•  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2
•  This includes time and categories
•  Set the BatchScanner’s ranges accordingly

•  Avoid using two scanners to query the index table and
then the record table
•  Store both the index and record data in the same table
•  Need to correctly format the data and use the
FamilyIntersectingIterator

Summary
•  Term-based inverted index
•  Take the value from the record table and make it the row id in the
index table
•  Great at single term queries
•  Bad at multi-term queries
•  Network bandwidth
•  Resources

•  Document Partitioned Index
•  Distributing the index by the document will ensure that all terms for
a record are served by a single Tablet Server
•  Leverage Iterators to do all the work server-side
•  Great at multi-term queries

Text Indexing in Accumulo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Text Indexing in Accumulo

Ähnlich wie Text Indexing in Accumulo (20)

Mehr von Aaron Cordova

Mehr von Aaron Cordova (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Indexing in Accumulo