SlideShare a Scribd company logo
1 of 21
Designing Analytics for Big Data 
J Singh 
November 7, 2014
2 
© DataThinks 2013-14 2 
Know thy Problem 
• Do you have a “Big Data” problem? 
– Or do you have a big “data problem”?
For Big “Data Problems” 
• Popular data sets (e.g., Amazon, Kaggle, …, data sets) 
– If it can be downloaded to your laptop, 
– If it can be subjected to ad hoc analysis using R or Python, 
– If it doesn’t change very often and doesn’t need to be 
3 
© DataThinks 2013-14 3 
continuously updated, 
• Subsets of “Big Data” datasets 
– Used to exactly specify a “big data” algorithm 
• Run it on your laptop 
• Iterate fast 
• Domain Knowledge is essential to solving the problem
Some Big Data problems (1) 
4 
© DataThinks 2013-14 4 
• Recommendations
Some Big Data problems (2) 
5 
© DataThinks 2013-14 5 
• Financial Analysis 
– Really Big Data if we want Real Time analysis
Some Big Data problems (3) 
• Internet Infrastructure Security Monitoring 
6 
© DataThinks 2013-14 6
Other Big Data problems 
• Network graph problems (Social Media data) 
• Bioinformatics problems (Genomics data) 
• Physics/engineering problems (Sensor data) 
• … 
7 
© DataThinks 2013-14 7
A specific problem: Document Storage 
• Website with thousands of pages 
– Some pages identical to other pages 
– Some pages nearly identical to other pages 
• To save storage, and smart indexing of the collection 
– Want to save just one copy of the duplicate pages 
– Want to save one copy of the nearly duplicate pages 
• To keep large document collection index up to date 
– Want to detect content changes quickly, possibly without 
reading old copies from a slow storage 
8 
© DataThinks 2013-14 8
Document Storage (pg 2) 
9 
© DataThinks 2013-14 9 
• Naïve algorithm 
– For every page 
• Compare to every other page 
– Calculate the “diff” between them 
– Find the minimum diff (min-diff) 
– Build a graph with nodes as pages and min-diffs as edges 
– Prune the graph to decide which nodes to store in entirety 
– Store all other nodes as node-ref + min-diff 
• Problems with this algorithm? 
– Comparison takes O(n2) operations 
– Need to keep the entire graph in memory before pruning
Document Storage (pg 3) 
10 
Buckets 
© DataThinks 2013-14 10 
• Locality-Sensitive Hashing 
(LSH) algorithm 
– Place each page in zero 
or more buckets 
independent of other 
pages 
– Make storage/diff 
decisions within a bucket 
• Features 
– O(n) algorithm 
– Can be parallelized 
Pages 
Mary had a little lamb x 
Little Jack Horner x 
Yankee Doodle went to Town x 
Jack and Jill went up the hill 
Hickory Dickory Dock x 
Mary Lamb's Little Pub x x 
Lil Jack Horner x 
Yankee Doodle was in Town x 
Jack and Jill were holding hands 
Boat of Hickory is Docked x 
Mary had a little lamb x 
Jack's Little Pub x
LSH Involves a Tradeoff 
• Pick the number of minhashes, the number of bands, and 
the number of rows per band to balance false 
positives/negatives. 
– False positives  need to examine more pairs that are not 
really similar. More processing resources, more time. 
– False negatives  failed to examine pairs that were similar, 
didn’t find all similar results. But got done faster! 
11 
© DataThinks 2013-14 11
12 
© DataThinks 2013-14 12 
LSH Tradeoff Example 
• If we had fewer than 20 bands, (and more rows / band) 
– fewer pairs would be selected for comparison, 
– the number of false positives would go down, 
– but the number of false negatives would go up, 
– Performance would go up but so would the error rate!
13 
© DataThinks 2013-14 13 
Summary 
• Mine the data and place members into hash buckets 
• When you need to find a match, hash it and possible 
nearest neighbors will be in one of the buckets. 
• Algorithm performance O(n) 
• Our implementation is designed to run on a Map Reduce 
Architecture 
– About 3 secs / document, 
– As many processors as required
Initial OpenLSH successes 
• We started OpenLSH to provide a framework for LSH 
• Organize multiple stages of the LSH pipeline as 
asynchronous elements 
– Don’t need the previous stage complete to begin the next 
– Make each stage as configurable as possible 
14 
© DataThinks 2013-14 14 
• Demonstrate results 
– Tweets from Twitter API to find “similar tweets”
Allow a focus on unique tweets by… 
15 
© DataThinks 2013-14 15 
• ...Eliminating Similar tweets: 
• score: 1.0 
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255 
– RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255 
• score: 0.75 
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/cgjdGpKvjK 
– Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/UZYbx1bT4K 
• score: 0.448275862069 
– NP on #Roots103 - 16 LOVING YOU: - Listen Now at http://t.co/0DK1u9SGyn or Download App - 
http://t.co/rdNJIvTzVH 
– NP on #Talk105 - The Brukfoot Show 20120523(ft. Mr. Vegas): Listen Now at 
http://t.co/0DK1u9SGyn or Download App - http://t.co/rdNJIvTzVH 
• score: 0.375 
– RT @JessicaMillaAg: Gaya Kamar Remaja Masa Kini - Smart Modern Style Teen Bedroom Design 
Ideas inspiration http://t.co/oD6tvUjFL2 
– RT @Nabilah88_Jkt48: Gaya Kamar Remaja Masa Kini - Awesome Fun and cheerful Teen Bedroom 
Design Ideas inspiration http://t.co/QZRruK5q0I
More recent OpenLSH successes 
• Apply OpenLSH to detect near identical documents in 
Peerbelt, a passive user behavior driven content 
prioritization & search engine 
– Goal is to eliminate “similar documents” from search results 
16 
© DataThinks 2013-14 16
OpenLSH Results with “Hello Bulgaria” website 
• Working with a 2000-web-pages, 
– Obtain 10 buckets with 75 distinct “near duplicate” pages 
– Some pages fall into multiple buckets, 
– Diagramming distances between them… 
17 
© DataThinks 2013-14
About the Implementation 
• Programming language: Python 
• Operating Environment: Google App Engine 
– Chosen because of minimal operational headaches 
– Chosen for easy integration with Map/Reduce 
– Can employ multiple machines when needed 
18 
© DataThinks 2013-14 18 
• Being ported to 
– Other Cloud Environments 
– A variety of data sources, e.g., MongoDB, Cassandra, …
19 
© DataThinks 2013-14 19 
Using OpenLSH 
• We’re looking for one or two more interesting use cases 
– Application areas: 
• Near de-duplication (covered with Peerbelt’s data) 
• Stocks that move independent of the herd 
• Filtering “unique stories” 
• Contact us to discuss 
• OpenLSH Source Repository: 
– https://github.com/singhj/locality-sensitive-hashing
20 
© DataThinks 2013-14 20 
Know thy needs 
• For Big “Data Problems” 
– About the Data: 
• Data Schema 
– About the Algorithms: 
• What they do 
• For “Big Data” Problems 
– About the Data 
• Data Schema 
• Storage layout 
– About the Algorithms 
• What they do 
• How they work 
– What if the temporary 
data structures don’t fit 
in memory? 
– Parallelizable? 
– Order: O(n)? O(n2)?
21 
© DataThinks 2013-14 21 
Thank you 
• J Singh 
– Principal, DataThinks 
• j.singh@datathinks.org 
• @singh_j 
• http://www.slideshare.net/j_singh 
• https://github.com/singhj/ 
• Adj. Prof, WPI 
• DataThinks.org 
– Focused on deep analytics, “big data” problems

More Related Content

What's hot

Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
Chris Huang
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
Qubole
 

What's hot (19)

Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Similar to Designing analytics for big data

Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
Ali Usman
 

Similar to Designing analytics for big data (20)

Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
How do You Graph
How do You GraphHow do You Graph
How do You Graph
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920
 
6-Query_Intro (5).pdf
6-Query_Intro (5).pdf6-Query_Intro (5).pdf
6-Query_Intro (5).pdf
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
 

More from J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

More from J Singh (19)

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Designing analytics for big data

  • 1. Designing Analytics for Big Data J Singh November 7, 2014
  • 2. 2 © DataThinks 2013-14 2 Know thy Problem • Do you have a “Big Data” problem? – Or do you have a big “data problem”?
  • 3. For Big “Data Problems” • Popular data sets (e.g., Amazon, Kaggle, …, data sets) – If it can be downloaded to your laptop, – If it can be subjected to ad hoc analysis using R or Python, – If it doesn’t change very often and doesn’t need to be 3 © DataThinks 2013-14 3 continuously updated, • Subsets of “Big Data” datasets – Used to exactly specify a “big data” algorithm • Run it on your laptop • Iterate fast • Domain Knowledge is essential to solving the problem
  • 4. Some Big Data problems (1) 4 © DataThinks 2013-14 4 • Recommendations
  • 5. Some Big Data problems (2) 5 © DataThinks 2013-14 5 • Financial Analysis – Really Big Data if we want Real Time analysis
  • 6. Some Big Data problems (3) • Internet Infrastructure Security Monitoring 6 © DataThinks 2013-14 6
  • 7. Other Big Data problems • Network graph problems (Social Media data) • Bioinformatics problems (Genomics data) • Physics/engineering problems (Sensor data) • … 7 © DataThinks 2013-14 7
  • 8. A specific problem: Document Storage • Website with thousands of pages – Some pages identical to other pages – Some pages nearly identical to other pages • To save storage, and smart indexing of the collection – Want to save just one copy of the duplicate pages – Want to save one copy of the nearly duplicate pages • To keep large document collection index up to date – Want to detect content changes quickly, possibly without reading old copies from a slow storage 8 © DataThinks 2013-14 8
  • 9. Document Storage (pg 2) 9 © DataThinks 2013-14 9 • Naïve algorithm – For every page • Compare to every other page – Calculate the “diff” between them – Find the minimum diff (min-diff) – Build a graph with nodes as pages and min-diffs as edges – Prune the graph to decide which nodes to store in entirety – Store all other nodes as node-ref + min-diff • Problems with this algorithm? – Comparison takes O(n2) operations – Need to keep the entire graph in memory before pruning
  • 10. Document Storage (pg 3) 10 Buckets © DataThinks 2013-14 10 • Locality-Sensitive Hashing (LSH) algorithm – Place each page in zero or more buckets independent of other pages – Make storage/diff decisions within a bucket • Features – O(n) algorithm – Can be parallelized Pages Mary had a little lamb x Little Jack Horner x Yankee Doodle went to Town x Jack and Jill went up the hill Hickory Dickory Dock x Mary Lamb's Little Pub x x Lil Jack Horner x Yankee Doodle was in Town x Jack and Jill were holding hands Boat of Hickory is Docked x Mary had a little lamb x Jack's Little Pub x
  • 11. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. – False positives  need to examine more pairs that are not really similar. More processing resources, more time. – False negatives  failed to examine pairs that were similar, didn’t find all similar results. But got done faster! 11 © DataThinks 2013-14 11
  • 12. 12 © DataThinks 2013-14 12 LSH Tradeoff Example • If we had fewer than 20 bands, (and more rows / band) – fewer pairs would be selected for comparison, – the number of false positives would go down, – but the number of false negatives would go up, – Performance would go up but so would the error rate!
  • 13. 13 © DataThinks 2013-14 13 Summary • Mine the data and place members into hash buckets • When you need to find a match, hash it and possible nearest neighbors will be in one of the buckets. • Algorithm performance O(n) • Our implementation is designed to run on a Map Reduce Architecture – About 3 secs / document, – As many processors as required
  • 14. Initial OpenLSH successes • We started OpenLSH to provide a framework for LSH • Organize multiple stages of the LSH pipeline as asynchronous elements – Don’t need the previous stage complete to begin the next – Make each stage as configurable as possible 14 © DataThinks 2013-14 14 • Demonstrate results – Tweets from Twitter API to find “similar tweets”
  • 15. Allow a focus on unique tweets by… 15 © DataThinks 2013-14 15 • ...Eliminating Similar tweets: • score: 1.0 – RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255 – RT @googoo كُنْ بَسيطا ، تَلفَت الأنظَار إليكْ .. فِي عَالَمْ امتلأ تَعقيداً :$ ! : 255 • score: 0.75 – Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/cgjdGpKvjK – Italian Gold Omega Necklace 14K Pm me if interested Happy Shopping http://t.co/UZYbx1bT4K • score: 0.448275862069 – NP on #Roots103 - 16 LOVING YOU: - Listen Now at http://t.co/0DK1u9SGyn or Download App - http://t.co/rdNJIvTzVH – NP on #Talk105 - The Brukfoot Show 20120523(ft. Mr. Vegas): Listen Now at http://t.co/0DK1u9SGyn or Download App - http://t.co/rdNJIvTzVH • score: 0.375 – RT @JessicaMillaAg: Gaya Kamar Remaja Masa Kini - Smart Modern Style Teen Bedroom Design Ideas inspiration http://t.co/oD6tvUjFL2 – RT @Nabilah88_Jkt48: Gaya Kamar Remaja Masa Kini - Awesome Fun and cheerful Teen Bedroom Design Ideas inspiration http://t.co/QZRruK5q0I
  • 16. More recent OpenLSH successes • Apply OpenLSH to detect near identical documents in Peerbelt, a passive user behavior driven content prioritization & search engine – Goal is to eliminate “similar documents” from search results 16 © DataThinks 2013-14 16
  • 17. OpenLSH Results with “Hello Bulgaria” website • Working with a 2000-web-pages, – Obtain 10 buckets with 75 distinct “near duplicate” pages – Some pages fall into multiple buckets, – Diagramming distances between them… 17 © DataThinks 2013-14
  • 18. About the Implementation • Programming language: Python • Operating Environment: Google App Engine – Chosen because of minimal operational headaches – Chosen for easy integration with Map/Reduce – Can employ multiple machines when needed 18 © DataThinks 2013-14 18 • Being ported to – Other Cloud Environments – A variety of data sources, e.g., MongoDB, Cassandra, …
  • 19. 19 © DataThinks 2013-14 19 Using OpenLSH • We’re looking for one or two more interesting use cases – Application areas: • Near de-duplication (covered with Peerbelt’s data) • Stocks that move independent of the herd • Filtering “unique stories” • Contact us to discuss • OpenLSH Source Repository: – https://github.com/singhj/locality-sensitive-hashing
  • 20. 20 © DataThinks 2013-14 20 Know thy needs • For Big “Data Problems” – About the Data: • Data Schema – About the Algorithms: • What they do • For “Big Data” Problems – About the Data • Data Schema • Storage layout – About the Algorithms • What they do • How they work – What if the temporary data structures don’t fit in memory? – Parallelizable? – Order: O(n)? O(n2)?
  • 21. 21 © DataThinks 2013-14 21 Thank you • J Singh – Principal, DataThinks • j.singh@datathinks.org • @singh_j • http://www.slideshare.net/j_singh • https://github.com/singhj/ • Adj. Prof, WPI • DataThinks.org – Focused on deep analytics, “big data” problems