2. Motivation
• Big Data is more opaque than small data
– Spreadsheets choke
– BI tools can’t scale
– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
– Faster “time to answer” on Big Data
– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• This is NOT about looking up items in a product
catalog (i.e. not a consumer search problem)
4. Classic “side system” approach
• Definition of KLUDGE: “a system and
especially a computer system made up of
poorly matched components” –Merriam-Webster
Search
Hadoop ?????
Cluster
5. Classic “search toolkit”
• Built around fulltext use case
• Inverted Indexes optimized for on-the-fly
ranking of results
– TF-IDF
– Okapi BM-25
• Yet never able to fully realize google-style
search capability
• Issues:
– Phrase detection
– Pseudo synonymy
– Open loop architecture
6. Big data ad-hoc query
• Not typically a fulltext “document search” problem
• Data is structured, mixed structured, and
denormalized
– Log lines
– Json records
– CSV files
– Hadoop native formats (SequenceFile)
• Ranking is explicit (ORDER BY), not relevance based
• Sometimes “needle in haystack” (support,
debugging)
• Sometimes “haystack in haystack” (summary
analytics, segmentation)
8. Finer points of Dremel architecture
• MapReduce friendly
• In-Situ approach is DFS friendly
• Excels at aggregation. Not so much for needle-in-
haystack.
• Column storage format accelerates mapreduce
(less extraneous data pushed through)
• But in some regards still a “side system”
• Applications must explicitly store their data in a
columnar format
• “massive” is both a benefit and a hazard
– Complex (operationally and WRT query execution)
– Queries can execute quickly…on huge clusters
9. Crawled In-Situ Index Architecture
Hadoop
Data Crawl
Application
MapReduce HDFS SimpleSearch
In-situ Index
10. Benefits to crawled In-Situ index
• No changes to application data format
– CSV
– JSON
– SequenceFile
• Clear “separation of concerns” between data
and index
• Indexes become “disposable”: easily built,
easily thrown away
• There is no “side system” that needs to be
maintained
• Use the mapreduce “hammer” to pound a nail
11. Architect for Elasticity
Crawl
Application
Elastic
AWS JetS3t
EC2
MapReduce S3 HTTP
M1.large
Index
Interesting: you don’t actually need to have hadoop installed…
12. Declarative Crawl Indexing
{
"filter”:"column[4]=="athens""
Hadoop }
Data Crawl
Parse.json Application
HDFS SimpleSearc
MapReduce h
In-situ Index
• Indexer reads declarative instructions from in-situ file
• “pull” vs. traditional “push” indexing approach
13. Thin index
Data Crawl
MapReduce Data
HDFS
Index
In-situ Index
• Index size is small because data is a holistic
part of the system
• data does not need to be “put into” the search
system and repicated in the index.
14. Lazy data loading
Data Crawl
Lazy Pull
Data
Execution
MapReduce HDFS Runtime
LRU
Index Lazy Pull Index
Cache