Designing analytics for big data

Designing Analytics for Big Data
J Singh
November 7, 2014

2
© DataThinks 2013-14 2
Know thy Problem
• Do you have a “Big Data” problem?
– Or do you have a big “data problem”?

For Big “Data Problems”
• Popular data sets (e.g., Amazon, Kaggle, …, data sets)
– If it can be downloaded to your laptop,
– If it can be subjected to ad hoc analysis using R or Python,
– If it doesn’t change very often and doesn’t need to be
3
continuously updated,
• Subsets of “Big Data” datasets
– Used to exactly specify a “big data” algorithm
• Run it on your laptop
• Iterate fast
• Domain Knowledge is essential to solving the problem

Some Big Data problems (1)
4
• Recommendations

5
• Financial Analysis
– Really Big Data if we want Real Time analysis

• Internet Infrastructure Security Monitoring
6

Other Big Data problems
• Network graph problems (Social Media data)
• Bioinformatics problems (Genomics data)
• Physics/engineering problems (Sensor data)
• …
7

A specific problem: Document Storage
• Website with thousands of pages
– Some pages identical to other pages
– Some pages nearly identical to other pages
• To save storage, and smart indexing of the collection
– Want to save just one copy of the duplicate pages
– Want to save one copy of the nearly duplicate pages
• To keep large document collection index up to date
– Want to detect content changes quickly, possibly without
reading old copies from a slow storage
8

Document Storage (pg 2)
9
• Naïve algorithm
– For every page
• Compare to every other page
– Calculate the “diff” between them
– Find the minimum diff (min-diff)
– Build a graph with nodes as pages and min-diffs as edges
– Prune the graph to decide which nodes to store in entirety
– Store all other nodes as node-ref + min-diff
• Problems with this algorithm?
– Comparison takes O(n2) operations
– Need to keep the entire graph in memory before pruning

Document Storage (pg 3)
10
Buckets
• Locality-Sensitive Hashing
(LSH) algorithm
– Place each page in zero
or more buckets
independent of other
pages
– Make storage/diff
decisions within a bucket
• Features
– O(n) algorithm
– Can be parallelized
Pages
Mary had a little lamb x
Little Jack Horner x
Yankee Doodle went to Town x
Jack and Jill went up the hill
Hickory Dickory Dock x
Mary Lamb's Little Pub x x
Lil Jack Horner x
Yankee Doodle was in Town x
Jack and Jill were holding hands
Boat of Hickory is Docked x
Mary had a little lamb x
Jack's Little Pub x

LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives  need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives  failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!
11

12
LSH Tradeoff Example
• If we had fewer than 20 bands, (and more rows / band)
– fewer pairs would be selected for comparison,
– the number of false positives would go down,
– but the number of false negatives would go up,
– Performance would go up but so would the error rate!

13
Summary
• Mine the data and place members into hash buckets
• When you need to find a match, hash it and possible
nearest neighbors will be in one of the buckets.
• Algorithm performance O(n)
• Our implementation is designed to run on a Map Reduce
Architecture
– About 3 secs / document,
– As many processors as required

Initial OpenLSH successes
• We started OpenLSH to provide a framework for LSH
• Organize multiple stages of the LSH pipeline as
asynchronous elements
– Don’t need the previous stage complete to begin the next
– Make each stage as configurable as possible
14
• Demonstrate results
– Tweets from Twitter API to find “similar tweets”

Allow a focus on unique tweets by…
15
• ...Eliminating Similar tweets:
• score: 1.0
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255
• score: 0.75
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/cgjdGpKvjK
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/UZYbx1bT4K
• score: 0.448275862069
– NP on #Roots103 - 16 LOVING YOU: - Listen Now at http://t.co/0DK1u9SGyn or Download App -
http://t.co/rdNJIvTzVH
– NP on #Talk105 - The Brukfoot Show 20120523(ft. Mr. Vegas): Listen Now at
http://t.co/0DK1u9SGyn or Download App - http://t.co/rdNJIvTzVH
• score: 0.375
– RT @JessicaMillaAg: Gaya Kamar Remaja Masa Kini - Smart Modern Style Teen Bedroom Design
Ideas inspiration http://t.co/oD6tvUjFL2
– RT @Nabilah88_Jkt48: Gaya Kamar Remaja Masa Kini - Awesome Fun and cheerful Teen Bedroom
Design Ideas inspiration http://t.co/QZRruK5q0I

More recent OpenLSH successes
• Apply OpenLSH to detect near identical documents in
Peerbelt, a passive user behavior driven content
prioritization & search engine
– Goal is to eliminate “similar documents” from search results
16

OpenLSH Results with “Hello Bulgaria” website
• Working with a 2000-web-pages,
– Obtain 10 buckets with 75 distinct “near duplicate” pages
– Some pages fall into multiple buckets,
– Diagramming distances between them…
17
© DataThinks 2013-14

About the Implementation
• Programming language: Python
• Operating Environment: Google App Engine
– Chosen because of minimal operational headaches
– Chosen for easy integration with Map/Reduce
– Can employ multiple machines when needed
18
• Being ported to
– Other Cloud Environments
– A variety of data sources, e.g., MongoDB, Cassandra, …

19
Using OpenLSH
• We’re looking for one or two more interesting use cases
– Application areas:
• Near de-duplication (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories”
• Contact us to discuss
• OpenLSH Source Repository:
– https://github.com/singhj/locality-sensitive-hashing

20
Know thy needs
• For Big “Data Problems”
– About the Data:
• Data Schema
– About the Algorithms:
• What they do
• For “Big Data” Problems
– About the Data
• Data Schema
• Storage layout
– About the Algorithms
• What they do
• How they work
– What if the temporary
data structures don’t fit
in memory?
– Parallelizable?
– Order: O(n)? O(n2)?

21
Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org
• @singh_j
• http://www.slideshare.net/j_singh
• https://github.com/singhj/
• Adj. Prof, WPI
• DataThinks.org
– Focused on deep analytics, “big data” problems

Designing analytics for big data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Designing analytics for big data

Similar to Designing analytics for big data (20)

More from J Singh

More from J Singh (19)

Recently uploaded

Recently uploaded (20)

Designing analytics for big data