More Related Content Similar to Designing analytics for big data (20) Designing analytics for big data2. 2
© DataThinks 2013-14 2
Know thy Problem
• Do you have a “Big Data” problem?
– Or do you have a big “data problem”?
3. For Big “Data Problems”
• Popular data sets (e.g., Amazon, Kaggle, …, data sets)
– If it can be downloaded to your laptop,
– If it can be subjected to ad hoc analysis using R or Python,
– If it doesn’t change very often and doesn’t need to be
3
© DataThinks 2013-14 3
continuously updated,
• Subsets of “Big Data” datasets
– Used to exactly specify a “big data” algorithm
• Run it on your laptop
• Iterate fast
• Domain Knowledge is essential to solving the problem
4. Some Big Data problems (1)
4
© DataThinks 2013-14 4
• Recommendations
5. Some Big Data problems (2)
5
© DataThinks 2013-14 5
• Financial Analysis
– Really Big Data if we want Real Time analysis
6. Some Big Data problems (3)
• Internet Infrastructure Security Monitoring
6
© DataThinks 2013-14 6
7. Other Big Data problems
• Network graph problems (Social Media data)
• Bioinformatics problems (Genomics data)
• Physics/engineering problems (Sensor data)
• …
7
© DataThinks 2013-14 7
8. A specific problem: Document Storage
• Website with thousands of pages
– Some pages identical to other pages
– Some pages nearly identical to other pages
• To save storage, and smart indexing of the collection
– Want to save just one copy of the duplicate pages
– Want to save one copy of the nearly duplicate pages
• To keep large document collection index up to date
– Want to detect content changes quickly, possibly without
reading old copies from a slow storage
8
© DataThinks 2013-14 8
9. Document Storage (pg 2)
9
© DataThinks 2013-14 9
• Naïve algorithm
– For every page
• Compare to every other page
– Calculate the “diff” between them
– Find the minimum diff (min-diff)
– Build a graph with nodes as pages and min-diffs as edges
– Prune the graph to decide which nodes to store in entirety
– Store all other nodes as node-ref + min-diff
• Problems with this algorithm?
– Comparison takes O(n2) operations
– Need to keep the entire graph in memory before pruning
10. Document Storage (pg 3)
10
Buckets
© DataThinks 2013-14 10
• Locality-Sensitive Hashing
(LSH) algorithm
– Place each page in zero
or more buckets
independent of other
pages
– Make storage/diff
decisions within a bucket
• Features
– O(n) algorithm
– Can be parallelized
Pages
Mary had a little lamb x
Little Jack Horner x
Yankee Doodle went to Town x
Jack and Jill went up the hill
Hickory Dickory Dock x
Mary Lamb's Little Pub x x
Lil Jack Horner x
Yankee Doodle was in Town x
Jack and Jill were holding hands
Boat of Hickory is Docked x
Mary had a little lamb x
Jack's Little Pub x
11. LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!
11
© DataThinks 2013-14 11
12. 12
© DataThinks 2013-14 12
LSH Tradeoff Example
• If we had fewer than 20 bands, (and more rows / band)
– fewer pairs would be selected for comparison,
– the number of false positives would go down,
– but the number of false negatives would go up,
– Performance would go up but so would the error rate!
13. 13
© DataThinks 2013-14 13
Summary
• Mine the data and place members into hash buckets
• When you need to find a match, hash it and possible
nearest neighbors will be in one of the buckets.
• Algorithm performance O(n)
• Our implementation is designed to run on a Map Reduce
Architecture
– About 3 secs / document,
– As many processors as required
14. Initial OpenLSH successes
• We started OpenLSH to provide a framework for LSH
• Organize multiple stages of the LSH pipeline as
asynchronous elements
– Don’t need the previous stage complete to begin the next
– Make each stage as configurable as possible
14
© DataThinks 2013-14 14
• Demonstrate results
– Tweets from Twitter API to find “similar tweets”
15. Allow a focus on unique tweets by…
15
© DataThinks 2013-14 15
• ...Eliminating Similar tweets:
• score: 1.0
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255
• score: 0.75
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/cgjdGpKvjK
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/UZYbx1bT4K
• score: 0.448275862069
– NP on #Roots103 - 16 LOVING YOU: - Listen Now at http://t.co/0DK1u9SGyn or Download App -
http://t.co/rdNJIvTzVH
– NP on #Talk105 - The Brukfoot Show 20120523(ft. Mr. Vegas): Listen Now at
http://t.co/0DK1u9SGyn or Download App - http://t.co/rdNJIvTzVH
• score: 0.375
– RT @JessicaMillaAg: Gaya Kamar Remaja Masa Kini - Smart Modern Style Teen Bedroom Design
Ideas inspiration http://t.co/oD6tvUjFL2
– RT @Nabilah88_Jkt48: Gaya Kamar Remaja Masa Kini - Awesome Fun and cheerful Teen Bedroom
Design Ideas inspiration http://t.co/QZRruK5q0I
16. More recent OpenLSH successes
• Apply OpenLSH to detect near identical documents in
Peerbelt, a passive user behavior driven content
prioritization & search engine
– Goal is to eliminate “similar documents” from search results
16
© DataThinks 2013-14 16
17. OpenLSH Results with “Hello Bulgaria” website
• Working with a 2000-web-pages,
– Obtain 10 buckets with 75 distinct “near duplicate” pages
– Some pages fall into multiple buckets,
– Diagramming distances between them…
17
© DataThinks 2013-14
18. About the Implementation
• Programming language: Python
• Operating Environment: Google App Engine
– Chosen because of minimal operational headaches
– Chosen for easy integration with Map/Reduce
– Can employ multiple machines when needed
18
© DataThinks 2013-14 18
• Being ported to
– Other Cloud Environments
– A variety of data sources, e.g., MongoDB, Cassandra, …
19. 19
© DataThinks 2013-14 19
Using OpenLSH
• We’re looking for one or two more interesting use cases
– Application areas:
• Near de-duplication (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories”
• Contact us to discuss
• OpenLSH Source Repository:
– https://github.com/singhj/locality-sensitive-hashing
20. 20
© DataThinks 2013-14 20
Know thy needs
• For Big “Data Problems”
– About the Data:
• Data Schema
– About the Algorithms:
• What they do
• For “Big Data” Problems
– About the Data
• Data Schema
• Storage layout
– About the Algorithms
• What they do
• How they work
– What if the temporary
data structures don’t fit
in memory?
– Parallelizable?
– Order: O(n)? O(n2)?
21. 21
© DataThinks 2013-14 21
Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org
• @singh_j
• http://www.slideshare.net/j_singh
• https://github.com/singhj/
• Adj. Prof, WPI
• DataThinks.org
– Focused on deep analytics, “big data” problems