Big Data Warehousing Meetup: Intro to NoSQL databases

Sponsored By:
Big Data Warehousing Meetup
Today’s Topic: Introduction to
NoSQL with 10Gen

WELCOME!
Joe Caserta
Founder & President, Caserta Concepts

7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta
President, Caserta Concepts
Author, Data Warehouse ETL Toolkit
Welcome
About the Meetup and about Caserta Concepts
7:30 Elliott Cordo
Principal Consultant, Caserta Concepts
Intro to NoSQL
7:50 Mike O’Brian
10Gen
MongoDB
8:10 -
9:00
More Networking
Tell us what you’re up to…
Agenda

About BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Next BDW Meetup: June 10.
• Topic: TBD (What would you like to see?)
Send ideas to joe@casertaconcepts.com

About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise

Client Portfolio
Finance
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services

Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Database
BI/Visualization/
Analytics
Master Data Management
Big Data
Analytics
Storm

Opportunities
Does this word cloud excite you?
Speak with us about our open positions: jobs@casertaconcepts.com

Contacts
Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com
Dana Canavan
Director, Sales & Marketing
P: (855) 755-2246 x226
E: dana@casertaconcepts.com
Elliott Cordo
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com

ANALYZING DATA: INTRO TO NOSQL
Elliott Cordo

Soo.. No More SQL?
• Relational databases still have their place
• Flexible/General Purpose
• Rich Query Syntax
• Familiar
• However there are some interesting alternatives for
analytic databases
• Columnar/Key Value
• Document
• Graph
• PS. many NoSQL databases have SQL-Like interfaces 
Think Not Only SQL!

Why are we doing this?
Not all data is efficiently stored in a relational DB.
• Sparse Data
• Data with a lot of variation
• Relationships -> funny how relational databases are not
great at relations

Scale and Performance
Performance:
• Relational databases have a lot of features, overhead that we
don’t need in many cases. Although we will miss some…
Scaling:
• Most relational databases scale vertically giving them limits to
how large they can get. Federation and Sharding is an
awkward manual process.
• Most NoSQL scale horizontally on commodity hardware
Note Graph database architecture lends itself to a single graph
existing on one server. Several vendors have overcome this:
Titan, InfiniteGraph.

Object Impedance Mismatch
Relational databases rarely look the way our applications want
them too. So much time is assembling and disassembling
relational data.
GetSale
Select * Sales_Header Join Sales_Detail Join
Sales_Tender join User Join Order Type Join
Tender Type Join Product Join Channel Join
User_Account etc, etc
CreateSale
Insert into Sales Header
Insert into Sales Detail
Insert/Update User_Account
Insert into Sales Tender
etc, etc

But what will we sacrifice?
• NoSQL DB’s have fairly simple query languages. Limited
support for the following:
• Joins
• Aggregation
• Secondary indexes
Why? - NoSQL databases were born to be high
performance
• Data is stored as it is to be used (tuned to a query) rather
than modeled around entities. So a sophisticated query
language is not needed.

So what about NoSQL as the Data
Warehouse?
• NoSQL databases are generally not as flexible as relational
databases for ad-hoc questions.
• Secondary indexes provide some flexibility but lack of Joins
requires denormalization
• Materialized views: Joins and aggregates can be implemented
via Map Reduce. Even using our animal friends:
• However materializing the world has it’s drawbacks!

NoSQL can be a good fit for certain
analytic applications
• High volumes/Low Latency analytic
environments
• Queries are largely known and can be
precomuted in-stream (via application itself or
Storm) or in batch using Map Reduce
• Cassandra also has counter functions which
can be helpful in pre-computing aggregates.
• Sweet spot is very high volumes with relatively
static analytic requirements.
RDBMS NoSQL
Volume
QueryFlexibility

• Platforms: Cassandra, HBase
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored
contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:
Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
Columnar

Document
• Platforms: MongoDB, CouchDB
• Collections are the equivalent to a table in a RDMS
• Primary unit of storage is a document
{ “User" : ”Bobby”,
“Email”: bobby@db-lover.com,
“Channel”: “Web”,
“State”: “NJ” }
{ “User" : ”Susie”,
“Email”: “Susie@sql-enthusiast.com”,
“PreferredCategories: [
{ Category: “Fashion”,
CategoryAdded: “2012-01-01” },
{ Category: “Outdoor Equipment”,
CategoryAdded: “2013-01-01” } ],
“Channel”: In-Store }

Graph
• Platforms: NeoJ4, Titan
• Relationship are front and center! Relationships can have properties
of their own.
Bobby
Jillian
Frank
Hair bowsChainsaw
Friends
Likes
Purchased
Date: 2013-02-14
Channel: In-Store
Friends
Susie
Purchased
Date: 2013-01-31
Recommendation: Maybe
Jillian wants a Chainsaw too!
Friends
Likes Profile
Date: 2013-01-01
Gremlin query language:
• Find all Franks outgoing Relationships
• Find all Products related to Jillian
• Find shortest path from Frank to Susie
• Cool collaborative filtering functions too!

Our Use Case: High Volume Sensor
Analytics
• Ingestion and analytics of Sensor Data
• 6 to 12 BILLION records being ingested daily (average
140k records per second at peek load)!
• Ingested data must be stored to disk and highly available
• Pre-defined aggregates and event monitors must be near
real-time
• Ad-hoc query capabilities required on historical data

How do we hope to accomplish this?
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Cassandra
Cluster
Kafka
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive

Parting Thought
Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler

Big Data Warehousing Meetup: Intro to NoSQL databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (16)

Mehr von Caserta

Mehr von Caserta (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Warehousing Meetup: Intro to NoSQL databases