1. Pictures at an Exhibition
Ruby, Rails, NoSQL and Big Data
John Repko
John Repko -- Pikasoft LLC
2. Agenda
The Goal: Exploring Big Data with NoSQL and Ruby on Rails
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak / MongoDB
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 2
3. So How Did We Get to Big Data Anyway?
Source: https://thedailyload.files.wordpress.com/2010/12/william_perry.jpg Source: http://www.startribune.com/sports/164830346.html
Big Data Is Not Just About “Big” Data … It’s About FAST Data!
(http://www.pikasoft.com/journal/2011/5/13/not-big-data-fast-data.html)
John Repko -- Pikasoft LLC 3
4. Why is Everyone Diving into Big Data?
There Are Big Data Breakthroughs Everywhere…
Google Wins Progressive’s
the Search Instant
Market “Overnight” rate
quotes
Source: https://newshour.s3.amazonaws.com/photos/2011/02/16/
kayjay_1_blog_main_horizontal.jpg
Progressive creates an
Massively parallel
insurance quote for
web searches with
“Watson” Wins on Jeopardy results back in a tenth
every car and truck in
the US – every night
Beat the best Jeopardy players of all time of a second
John Repko -- Pikasoft LLC 4
5. Exploring Big Data
Big Data frequently provides solutions to a common set of problems
Source: http://www.slideshare.net/cloudera/20100806-cloudera-10-hadoopable-problems-webinar-4931616
These appear to be “10 Problems” but are really only “2 Problems”
John Repko -- Pikasoft LLC 5
6. Exploring Big Data
The variety of Big Data wins in the press fall into just two solution patterns
• Foresight
– We are presented a pattern – What has the outcome
been when we’ve seen similar patterns in the past?
• Hindsight
– We are presented an outcome -- What pattern of events
anticipated the outcome in the past?
You Don’t Need Dozens Of Solution Approaches For Big Data – Just Two
John Repko -- Pikasoft LLC 6
7. Exploring Big Data
In this light, let’s take a look at the “10 Hadoop-able Problems” of Big Data
Summary – 10 Common Hadoop-able Problems*
1. Modeling True Risk
• What past patterns led to success or default?
1. Customer Churn Analysis
• What do customer churn patterns predict about our products and markets?
1. Recommendation Engine
• We have search terms – what have the results been from similar searches in the past?
1. Ad Targeting
• We have profile information – what offers have led to sales for similar profiles in the past?
1. PoS Transaction Analysis
• We have your purchase history – what deals might we offer in the future?
Foresight Hindsight
John Repko -- Pikasoft LLC 7
8. Exploring Big Data
These two solution types apply generally to the Hadoop-able problems
Summary – 10 Common Hadoop-able Problems
6. Analyzing Data Logs to Forecast Events
• We have your logs – what pattern of events have anticipated failures before?
6. Threat Analysis
• We have a specific event – what results have we seen from similar threats in the past?
6. Trade Surveillance
• Does this parcel raise any alarms, based on our history of past parcel-tracking?
6. Search Quality
• We have a set of search terms – what have similar searches succeeded in finding in the
past?
6. Data “Sandbox”
• We have your data, possibly unstructured data. What patterns in that data might we
bring to your attention now?
Foresight Hindsight
John Repko -- Pikasoft LLC 8
9. The Big Data Platform Provides with Rich Analytics Tools
Key Big Data Analytics Solution Patterns
1. Predictive Modeling 5. Outlier Analysis
2. Data Visualization 6. AB Testing
7. Markov Chains
3. Cluster Partitioning
8. Bloom Filters
4. Collaborative Filtering
John Repko -- Pikasoft LLC 9
10. Exploring Big Data
With Just Two Standard Solution Models We Can
Solve Most Big Data Problems
The Key Is To Shape Big Data Into A Standard
Platform Onto Which We Can Apply These
Analytics Tools…
“It is not the technology that creates a competitive edge, but the
management process that exploits technology."
~ Shaping the Future- Peter Keen (1991)
John Repko -- Pikasoft LLC 10
11. Agenda
The Goal: Exploring Big Data
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak / MongoDB
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 11
13. Agenda
The Goal: Exploring Big Data
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 13
14. Redis
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
• Example:
– http://www.pikasoft.com/journal/2011/1/2/a-quick-redis-key
-value-example-for-the-holidays.html
• Backing Articles:
– http://purevirtual.de/2010/04/url-shortener-with-redis-and-rails3/
• Code:
– http://www.pikasoft.com/journal/2011/1/2/a-quick-redis-key-value-example-for-the-holidays.html
The good news is, we've already got our base image, and adding a new Redis data store and
example app to it only took about an hour. As before, you can play with the URL-shortener at Redis
URL Shortener, and you can download and play with the code for the application at:Redis URL
Shortener Source Code.
Play with this online at:
http://jkr-blog.dyndns.org:3001/mini_urls
John Repko -- Pikasoft LLC 14
16. Agenda
The Goal: Exploring Big Data
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak / MongoDB
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 16
17. MongoDB
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-
redis
• Example:
– http://www.pikasoft.com/journal/2010/7/31/nosql-on-the-cloud-our-first-
application.html
• Backing Articles:
– http://www.mongodb.org/display/DOCS/Building+for+
Linux
• Code:
– http://www.pikasoft.com/journal/2010/8/16/why-our-little-
nosql-app-matters.html
So let's sum up -- after a handful of posts and a small but still sorrowful amount of command-line and rails code,
we've managed to accomplish the following "Hello World" tasks in NoSQL on the cloud:
•Created a cloud account
•Got our first app created, and saw it in a browser on the web
•Loaded up real development environments (Ruby/Rails we added, Java we got for free)
•Added a stronger app server (thin >> webrick) and a stronger web server (nginx >> almost anything)
•Added our first NoSQL data store (MongoDB) and mapping software to simulate ActiveRecord in NoSQL
•Created a little NoSQL app to show all this, and made it visible though a dynamic DNS address:
Rails Mongo Notes Example
Just to wrap the little app up: I updated John Nunemaker's Mongomapper demo app to work with Rails3 and the
cloud, and if you like you can take a look at the code for it here: Rails Mongo Code.
Play with this online at:
http://jkr-code.dyndns.org:3000/notes
John Repko -- Pikasoft LLC 17
18. Cassandra
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
• Example:
– http://www.pikasoft.com/journal/2011/2/14/casi-casi-
cassandra.html
• Backing Articles:
– http://www.25hoursaday.com/weblog/2008/05/23/
SomeThoughtsOnTwittersAvailabilityProblems.aspx
• Code:
Here's what the code for that broadcast might look like:
# Tweeter class Tweeter < ActiveRecord::Base has_many :followers end -
class Follower < ActiveRecord::Base belongs_to :tweeter end
All fine so far -- that's the twittery world we all live in. I can send out my breathless message of what
I had for breakfast, and then Twitter picks it up and broadcasts the message from me (and all the
messages from the other tweeters):
@tweeters = Tweeter.find(:all_tweeters) @tweeters.each do |tweeter|
@followers = tweeter.find(:all) @followers.each do |follower|
tweeter.broadcast_to :recipient => follower end end end
So here we're going to do a query for each of the X tweeters, and for them we'll do another query for
each of their Y followers.
Code smell! Fail Whale!!!
John Repko -- Pikasoft LLC 18
19. Agenda
Exploring Big Data
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak / MongoDB
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 19
20. Neo4J
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
• Example:
– http://www.pikasoft.com/journal/2011/1/21/graph-databases-and-star-
wars.html
• Backing Articles:
– http://purevirtual.de/2010/04/url-shortener-with-redis-and-rails3/
• Code
Play with this online at:
Six Degrees of Kevin Bacon = http://jkr-blog.dyndns.org:9292/
John Repko -- Pikasoft LLC 20
21. Agenda
Exploring Big Data
Just Two Solutions – Here’s How We Get There
• Key-Value Data Stores
– Redis
– Riak
• Document Data Stores
– MongoDB
– Cassandra
• Graph Data Stores
– Neo4J
• MapReduce
– Through Hadoop
– Through Riak
– Through Elastic Mapreduce
John Repko -- Pikasoft LLC 21
22. MapReduce via Hadoop, Thrift and AWS
• Example: Reduce
– http://www.pikasoft.com/journal/2011/1/9/nosql-next-up-hadoop-and-
cloudera.html
• Backing Articles:
– http://www.joelonsoftware.com/items/2006/08/01.
html
• Code:
Map
John Repko -- Pikasoft LLC 22
25. Summary
This Is Only The Beginning. With A
Standard Platform We’ll See Richer Big Data
Discoveries Become Routine
The Solution Tools (Slide 9) Become
Straightforward if We Run Them on a
Standard Architecture
“One man’s noise is another man’s data.”
~ Bill Stensrud - InstantEncore
John Repko -- Pikasoft LLC 25
26. Contacts
• John Repko: john.repko@pikasoft.com
http://pikasoft.s3.amazonaws.com/Pictures_at_an_Exhibition.pptx
John Repko -- Pikasoft LLC 26