We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
3. 7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta
President, Caserta Concepts
Author, Data Warehouse ETL Toolkit
Welcome
About the Meetup and about Caserta Concepts
7:30 Elliott Cordo
Principal Consultant, Caserta Concepts
Intro to NoSQL
7:50 Mike O’Brian
10Gen
MongoDB
8:10 -
9:00
More Networking
Tell us what you’re up to…
Agenda
4. About BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Next BDW Meetup: June 10.
• Topic: TBD (What would you like to see?)
Send ideas to joe@casertaconcepts.com
5. About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise
11. Soo.. No More SQL?
• Relational databases still have their place
• Flexible/General Purpose
• Rich Query Syntax
• Familiar
• However there are some interesting alternatives for
analytic databases
• Columnar/Key Value
• Document
• Graph
• PS. many NoSQL databases have SQL-Like interfaces
Think Not Only SQL!
12. Why are we doing this?
Not all data is efficiently stored in a relational DB.
• Sparse Data
• Data with a lot of variation
• Relationships -> funny how relational databases are not
great at relations
13. Scale and Performance
Performance:
• Relational databases have a lot of features, overhead that we
don’t need in many cases. Although we will miss some…
Scaling:
• Most relational databases scale vertically giving them limits to
how large they can get. Federation and Sharding is an
awkward manual process.
• Most NoSQL scale horizontally on commodity hardware
Note Graph database architecture lends itself to a single graph
existing on one server. Several vendors have overcome this:
Titan, InfiniteGraph.
14. Object Impedance Mismatch
Relational databases rarely look the way our applications want
them too. So much time is assembling and disassembling
relational data.
GetSale
Select * Sales_Header Join Sales_Detail Join
Sales_Tender join User Join Order Type Join
Tender Type Join Product Join Channel Join
User_Account etc, etc
CreateSale
Insert into Sales Header
Insert into Sales Detail
Insert/Update User_Account
Insert into Sales Tender
etc, etc
15. But what will we sacrifice?
• NoSQL DB’s have fairly simple query languages. Limited
support for the following:
• Joins
• Aggregation
• Secondary indexes
Why? - NoSQL databases were born to be high
performance
• Data is stored as it is to be used (tuned to a query) rather
than modeled around entities. So a sophisticated query
language is not needed.
16. So what about NoSQL as the Data
Warehouse?
• NoSQL databases are generally not as flexible as relational
databases for ad-hoc questions.
• Secondary indexes provide some flexibility but lack of Joins
requires denormalization
• Materialized views: Joins and aggregates can be implemented
via Map Reduce. Even using our animal friends:
• However materializing the world has it’s drawbacks!
17. NoSQL can be a good fit for certain
analytic applications
• High volumes/Low Latency analytic
environments
• Queries are largely known and can be
precomuted in-stream (via application itself or
Storm) or in batch using Map Reduce
• Cassandra also has counter functions which
can be helpful in pre-computing aggregates.
• Sweet spot is very high volumes with relatively
static analytic requirements.
RDBMS NoSQL
Volume
QueryFlexibility
18. • Platforms: Cassandra, HBase
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored
contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:
Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
Columnar
19. Document
• Platforms: MongoDB, CouchDB
• Collections are the equivalent to a table in a RDMS
• Primary unit of storage is a document
{ “User" : ”Bobby”,
“Email”: bobby@db-lover.com,
“Channel”: “Web”,
“State”: “NJ” }
{ “User" : ”Susie”,
“Email”: “Susie@sql-enthusiast.com”,
“PreferredCategories: [
{ Category: “Fashion”,
CategoryAdded: “2012-01-01” },
{ Category: “Outdoor Equipment”,
CategoryAdded: “2013-01-01” } ],
“Channel”: In-Store }
20. Graph
• Platforms: NeoJ4, Titan
• Relationship are front and center! Relationships can have properties
of their own.
Bobby
Jillian
Frank
Hair bowsChainsaw
Friends
Likes
Purchased
Date: 2013-02-14
Channel: In-Store
Friends
Susie
Purchased
Date: 2013-01-31
Recommendation: Maybe
Jillian wants a Chainsaw too!
Friends
Likes Profile
Date: 2013-01-01
Gremlin query language:
• Find all Franks outgoing Relationships
• Find all Products related to Jillian
• Find shortest path from Frank to Susie
• Cool collaborative filtering functions too!
21. Our Use Case: High Volume Sensor
Analytics
• Ingestion and analytics of Sensor Data
• 6 to 12 BILLION records being ingested daily (average
140k records per second at peek load)!
• Ingested data must be stored to disk and highly available
• Pre-defined aggregates and event monitors must be near
real-time
• Ad-hoc query capabilities required on historical data
22. How do we hope to accomplish this?
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Cassandra
Cluster
Kafka
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive
23. Parting Thought
Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler