Weitere ähnliche Inhalte Ähnlich wie The Amino Analytical Framework - Leveraging Accumulo to the Fullest (20) Kürzlich hochgeladen (20) The Amino Analytical Framework - Leveraging Accumulo to the Fullest 1. Framework for Big Data Discovery and Analytics
© 2013 42six Solutions, All Rights Reserved, www.42six.com
2. Hadoop MapReduce
• We can look across all our data to
answer questions!
Problem Statement:
Developers can write MapReduce code to analyze data, but don’t know what to look
for; the analysts know what to look for, but don’t know how to write code.
Technology is not the problem. It’s enabling the analyst to effectively leverage
technology and reuse it.
© 2013 42six Solutions, All Rights Reserved, www.42six.com
3. Typical Analyst Workflow:
• I have an entity I want to learn more about
• Everything is indexed by entities
• We can ask questions of Big Data, but they aren’t Big
Questions – we always start with an entity
We should be able to:
• Have a pattern and see entities that match that pattern
• We can ask complex questions of Big Data
© 2013 42six Solutions, All Rights Reserved, www.42six.com
4. Naïve Way:
Custom MapReduce job for each question
Amino Way:
Pre-compute features (micro-analytics), the building blocks of questions, and let
analysts mix those on the fly to ask complex questions
The Amino index executes Analysts’ complex questions as a real time scan, less
competition for resources, more scalable.
Scales to billions of entities and features
© 2013 42six Solutions, All Rights Reserved, www.42six.com
6. Amino Framework
Feature Creation API
• Abstracts the complexities of MapReduce
• Focus on logic of the feature/micro-analytic
• Write-once DataLoader for each data source
• Simple and powerful data joins
Amino Index
• AminoOutputFormat
• Bulk Ingest into Accumulo
Query API
• Iterators
© 2013 42six Solutions, All Rights Reserved, www.42six.com
8. Benefits
• Data Agnostic
• Not a black box
• Fully scalable
• Crowd source micro-analytics
• Inherent cross-datasource linked indexes
• Encourages sharing of knowledge, discovery
• Index built to support machine learning
• Security considered up front – index is in Accumulo
• Built on open source, for open source
© 2013 42six Solutions, All Rights Reserved, www.42six.com
9. Feature Creation
-Can join multiple datasets
-Keys are established in the DataLoader
Any external job can output this format
and it will be indexed properly during
indexing jobs
Notice there’s no key – that’s on
purpose!
© 2013 42six Solutions, All Rights Reserved, www.42six.com
10. Index Goals
Now all our features are indexed, let’s let the analysts start building!
• Fast scans
• Highly dimensional scans
• Data compression
• Simple query structure
© 2013 42six Solutions, All Rights Reserved, www.42six.com
11. Accumulo Index 1: More Dimensions than Entities
Row
CF
Shard Number: Data Source : Bucket Name Bucket Value
CQ
Value
Hash Salt Compressed Bitmap
Example:
Row
CF
CQ
Value
2:Twitter:handle
stevetouw
0
010011010010011
JavaEWAH is a word-aligned compressed variant of the
Java bitset class. It does not achieve the best
compression, but rather improves query processing time
Indexes in the bit vector represent the features that entity falls in –
a feature vector
© 2013 42six Solutions, All Rights Reserved, www.42six.com
12. At Query Time…
Bloom Filter based on Lexicographical first and last of
each dimension of the query
Number of followers: 10 - 200
First: aachimba
Last: zzrka
Number of tweets per day: 0 - 6
First: aaabbb
Last: zyrbb
Handle starts with letter: S
First: saarba
Last: szaban
Smallest range
Dimensions map to a query bit vector
000001001111000101000011100101010011100101
Note there is an index for every possible value between the
ratio features
© 2013 42six Solutions, All Rights Reserved, www.42six.com
14. What is the Salt For?
Row
CF
CQ
Value
Shard Number: Data Source : Bucket Name Bucket Value Hash Salt Compressed Bitmap
Row
CF
CQ
Value
2:Twitter:handle
stevetouw
0
0100110100100101
Collisions are possible (using
32 bit vector). Salt is used to
hash the feature indexes, so
you need as many matches in
the previous step as you have
salts.
We have used 3 salts with 15
billion records and have had
no collisions
© 2013 42six Solutions, All Rights Reserved, www.42six.com
15. Benefits of this Index
• Tables are small, bit vector compression is good, only one row per
entity
• Works great if you have more dimensions than you have entities or
the range in your dimensions are good bloom filters (like “handle
starts with letter …”)
• No matter how many dimensions, the query will always be as fast
as the smallest range
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable
• Represents a feature vector for your entities – great for machine
learning
© 2013 42six Solutions, All Rights Reserved, www.42six.com
16. Accumulo Index 2: More Entities than Dimensions
Row
CF
CQ
shard:salt Data Source#Bucket Name#FeatureId
Value
Feature Value
Compressed Bitmap
Example:
Row
CF
CQ
Value
2:0
Twitter#handle#123456
s
0100110100101001
123456 could map to feature “Handle starts with letter”
Indexes in the bit vector represent the entities that fall in that
feature
So handle stevetouw could map to index 73 (for salt 0)
© 2013 42six Solutions, All Rights Reserved, www.42six.com
17. That Same Query Again…
Number of followers: 10 – 200 (feature id: 444411)
Number of tweets per day: 0 – 6 (feature id: 555522)
Handle starts with letter: S (feature id: 123456)
Row
CQ
Value
2:0
OR
CF
Twitter#handle#444411
10
0010111011100
2:0
Twitter#handle#444411
11
0101010101101
……
2:0
OR
200
0000001011000
2:0
AND
Twitter#handle#444411
Twitter#handle#555522
0
1111110001101
2:0
Twitter#handle#555522
1
1010100000100
……
2:0
Twitter#handle#555522
6
1111001010000
2:0
Twitter#handle#123456
S
1111110001101
Magic iterator that handles all the boolean logic
© 2013 42six Solutions, All Rights Reserved, www.42six.com
19. Convert Indexes to Entities
Row
CF
CQ
shard
Index Position#Data Source#Bucket Name#Salt
Value
Bucket Value
Example:
Row
CF
CQ
2
73#Twitter#handle#0
Value
stevetouw
The iterator scans the rows using a CF filter with the indexes desired
The iterator ensures it gets the same CQ “# of salts” times before it sends
the resulting CQ results back
Again, use the power of iterators and pushing code to the data rather than
doing the salt set operation in the web tier
© 2013 42six Solutions, All Rights Reserved, www.42six.com
20. Benefits of this Index
• Tables are small, bit vector compression is good
• Works great if you have more entities than you have dimensions
(most likely scenario)
• Affords the ability to do full boolean logic in-iterator, rather than just
ANDs as in the previous index
• All processing/boolean logic occurs on the nodes (thanks iterators),
fully scalable
© 2013 42six Solutions, All Rights Reserved, www.42six.com
21. Conclusion
• Amino helps non-technical folk leverage MapReduce cleanly and without
hogging cluster resources
• Accumulo iterators are the reason for the index performance
• Amino is all about sharing and reuse, crowd source the building blocks,
save analysts hypotheses, the more people touching Amino, the smarter
it becomes
• Open source (documentation needs help): https://github.com/aminocloud/amino
© 2013 42six Solutions, All Rights Reserved, www.42six.com