This document discusses scaling MongoDB with sharding through a case study. It provides an introduction to CIGNEX Datamatics, an IT services company, and their MongoDB and Big Data practices. It then outlines a use case of scaling MongoDB to support 7 million users with 8 devices each through sharding, and benchmarking the performance of different shard key choices.
Webinar: Scaling MongoDB through Sharding - A Case Study with CIGNEX Datamatics
1. Scaling
MongoDB
with
Sharding
–
A
Case
Study
Presented
by
Yash
Badiani
and
Rahul
Nair
CIGNEX
Datamatics
Con1idential
www.cignex.com
2. About
CIGNEX
Datamatics
A
subsidiary
of
Datamatics
Global
Services
Limited
CIGNEX
Datamatics
Con1idential
www.cignex.com
2
3. Introduction
of
Datamatics
(DGSL)
• Mission
Strategic
Alliances
– Experts
in
improving
Enterprise
productivity
through
Process
Engineering
&
Information
Management
Solutions
• Key
Highlights
– Founded
in
1975
– Publicly
listed
in
India
– Annual
consolidated
revenue
of
US$100
Million
– Fortune
500
clients
– 4,400+
employees
across
22
of1ices
in
9
countries
CIGNEX
Datamatics
Con1idential
www.cignex.com
3
4. What
Does
CIGNEX
Datamatics
Do?
Since
2000,
making
Open
Source
work
for
the
enterprise
through
adoption
and
integration
to:
Portal
Solutions
Content
• Address
business
goals
Solutions
• Increase
business
velocity
• Lower
the
cost
of
doing
business
• Reduce
TCO
Big
Data
• Gain
competitive
advantage
Solutions
400+
implementations
worldwide
across
industries
CIGNEX
Datamatics
Con1idential
www.cignex.com
4
5. Where
We
Can
Help
You
SOLUTIONS
• Intranet
• S o c i a l
Portals
Liferay,
Drupal,
JBoss,
•
•
Extranet
EAI
Collabora>on
• Mobile
Portals
User
eXperience
ZK,
HTML5,
• SOA
PlaRorm
MuleSoW
Alfresco,
Adobe
CQ,
• WCM
Content
Drupal,
Magento,
• DM
• E-‐Commerce
Enterprise
Content
• RM
• E-‐learning
JBoss,
Moodle,
EphesoW,
• CMS
• ERP
Management
• DAM
• Imaging
Liferay
Solu>ons
• Analy>cs
• DW
-‐
BI
Hadoop,
MongoDB,
Neo4j,
Big
Data
•
•
Mobile
Social
• Log
Processing
Flume,
Hive
and
Analysis
Making
Data
Work
• Web
• Enterprise
Solr,
Pentaho,
JaspersoW
• Real-‐>me
Search
SERVICES
UI,
Development
,
Integra>on,
Customiza>on,
Migra>on
,
Tes>ng,
Training
,
Support
(24*7)
Managed
Cloud
Services
-‐
Develop,
Deploy,
Manage
VAR/Annual
Product
Subscrip>on
-‐
Liferay,
Alfresco,
Cloudera
Hadoop,
MongoDB
Extended
Development
Center
–
Center
of
Excellence
CIGNEX
Datamatics
Con1idential
www.cignex.com
5
6. About
the
Presenters
• Yash
Badiani
is
the
Big
Data
Practice
Lead
at
CIGNEX
Datamatics
and
focuses
on
Big
Data
Technologies
including
MongoDB
&
Hadoop.
He
has
worked
extensively
on
large
Data
warehousing
&
Business
Intelligence
projects
with
tools
such
as
Business
Objects,
Microsoft
SQL
Server,
Microstrategy,
IBM
Cognos.
• Gaurav
Khambhala
works
at
CIGNEX
Datamatics
as
Technical
Lead.
He
is
the
senior
member
of
the
PHP
Practice
at
CIGNEX
Datamatics
and
is
involved
on
various
technology
initiatives
like
Big
Data
where
he
focuses
on
integration
of
PHP
with
NoSQL
sources
like
MongoDB.
He
has
a
wide
industry
experience
in
software
development
&
management
in
Open
Source
technologies
such
as
Drupal
&
Moodle
CIGNEX
Datamatics
Con1idential
www.cignex.com
6
7. Agenda
• CIGNEX
Datamatics
–
Introduction
&
Offerings
• Use
Case
&
Database
Requirements
• Challenges
with
Traditional
Databases
• Why
MongoDB?
• Solution
– Approach
– Architecture
and
Hardware
Sizing
• Scaling
with
Sharding
– Sharding
Basics
– Sharding
–
Choosing
the
RIGHT
Shard
Key
– Benchmarking
with
Results
• Key
Takeaways
CIGNEX
Datamatics
Con1idential
www.cignex.com
7
8. Big
Data
Practice
At
CIGNEX
Datamatics
Brief
Snapshot
• ~40
employee
Big
Data
Practice
Technology
Partnership
focused
on
Hadoop,
MongoDB,
Neo4j,
Solr
• Professionals
formally
trained
/
certi1ied
from
Cloudera
and
10gen
• Expertize
in
Hadoop
Eco-‐System
(HBase,
Pig,
Hive,
Flume,
Sqoop,
Oozie,
Zookeeper)
• Strong
partnerships:
• System
Integration
partners
with
Cloudera
for
CDH
• Global
partner
with
10gen
for
MongoDB
–
multiple
webinars
on
different
solutions
CIGNEX
Datamatics
Con1idential
www.cignex.com
8
9. Our
Offerings
–
Big
Data
Support
&
Consulting
Implementation
Training
Consulting
Implementation
Support
&
Training
• Business
Analysis
• UI
Development
• DBA
Support
• Technology
Evaluation
• Application
Integration
• Application
Support
• Architecture
• Customization
• Enhancements
• Design
Framework
• Migration
• 24*7
Production
• Cluster
sizing
• Testing
Support(Tier
1/2/3)
• Deployment
planning
• Performance
Tuning
• Trainings
• Proof-‐of-‐Concept
• Health
Check
• Performance
Benchmarking
CIGNEX
Datamatics
Con1idential
www.cignex.com
9
10. Use
Case
Load
Users
Devices
Database
Balancer
Data
Storage
App.
Layer
End
Users
Devices
7
Million
Users
8
devices
/
user
Load
Balancer
mongoDB
cluster
Spread
Across
Home/OfMice/ Receives
high
Sharding
Geography
Anywhere
volume
of
Replication
with
concurrent
CRUD
Automatic
requests
Failover
Routes
request
Indexes
trafMic
to
DB
cluster
CIGNEX
Datamatics
Con1idential
www.cignex.com
10
11. Database
Requirements
Flexibility
High
in
Schema
Performance
Agility
in
Development
&
Deployment
Availability
Enterprise
Level
Support
CIGNEX
Datamatics
Con1idential
www.cignex.com
11
12. Limitations
of
RDBMS
Support
limited
to
Manage
only
Structured
RDBMS
doesn’t
scale
Feature
rich
but
slow
terabytes
Data
inherently
performance
$
Complex
to
Shard/Partition
Limitations
in
scaling
High
Specialized
Hardware
-‐
Vertical
Scaling
expensive
due
to
maintenance
of
schema
volume
of
concurrent
CRUD
Expensive
and
dif1icult
to
scale
RDBMS
can’t
manage
all
dimensions
of
data
with
speed
&
at
lower
cost.
CIGNEX
Datamatics
Con1idential
www.cignex.com
12
13. Why
MongoDB?
Flexibility
High
in
Schema
Performance
• Easy
integration
• Concurrent
CRUD
• Ease
of
schema
• Fast
Updates
Agility
in
design
• Write
distribution
Development
• Document
oriented
with
Sharding
&
Deployment
storage
Schema
free
• Programming
Indexes
&
Sharding
Language
drivers
• Shorter
Dev
cycle
• Faster
deployment
Enterprise
Availability
Level
Support
Driver
Support
• Automatic
failover
• Global
Coverage
• Redundancy
• 24x7
Support
• 100%
uptime
• Ease
of
maintenance
Replication
Strong
Community
CIGNEX
Datamatics
Con1idential
www.cignex.com
13
14. Solution:
Approach
Schema
• Schema
Design
• Collections
and
Field
De1initions
• Document
Size
Database
Size
• Total
expected
data
size
• Frequency
of
CRUD
operations
Concurrent
Load
• Read/Write
ratio
• Automatic
Failover
Availability
• Replication
and
Backup
• Working
Set
Indexing
• Access
Patterns
• Horizontal
Scaling
Sharding
• Query
Performance
• Cluster
sizing
Hardware
Sizing
• RAM
and
Disk
storage
CIGNEX
Datamatics
Con1idential
www.cignex.com
14
15. Solution:
Architecture
Con1ig
Servers
Shard
1
mongos
mongod
Server
App
Primary
Mongod
mongod
mongod
Arbiter
mongod
Secondary
mongos
Server
mongod
App
Shard
2
mongod
Primary
Mongod
mongos
Server
Arbiter
App
Balancer
Data
Tier
Load
mongod
Secondary
Routed
Requests
from
mongos
to
shards
mongos
Server
App
Shard
3
mongod
Primary
Mongod
Arbiter
mongos
mongod
Server
App
Secondary
Shard
4
mongod
mongos
Server
Primary
App
Mongod
Arbiter
mongod
Secondary
App
Tier
Routed
for
non-‐
sharded
collections
Replica
Set
mongod
Primary
Mongod
Arbiter
mongod
Secondary
CIGNEX
Datamatics
Con1idential
www.cignex.com
15
16. Sharding
–
What
is
it?
• Distributes
single
logical
database
system
across
clusters
• Allows
to
partition
a
collection
across
#
of
mongod
instances(shards)
• Advantages:
– Increases
write
capacity
– Ability
to
support
larger
working
sets
– Raises
limits
of
data
size
beyond
a
single
node
CIGNEX
Datamatics
Con1idential
www.cignex.com
16
17. Sharding
-‐
Features
• Range-‐based
Data
Partitioning
• Automatic
Data
volume
distribution
• Transparent
query
routing
• Horizontal
capacity
– Additional
write
capacity
through
distribution
– Right
shard
key
allows
expansion
of
working
set
CIGNEX
Datamatics
Con1idential
www.cignex.com
17
18. Sharding
–
When
to
use?
Your
data
set
approaches
or
exceeds
the
storage
capacity
of
a
single
node
in
your
system
Storage
Drive
The
size
of
your
system’s
active
working
set
will
soon
exceed
the
capacity
of
the
maximum
amount
of
RAM
for
your
system
RAM
Working
Set
Your
system
has
a
large
amount
of
write
activity,
a
single
MongoDB
instance
cannot
Storage
write
data
fast
enough
to
meet
demand,
and
all
Drive
other
approaches
have
not
reduced
contention
CIGNEX
Datamatics
Con1idential
www.cignex.com
18
19. Shard
Keys
Shard
Keys:
•
The
ideal
shard
key
:
Exist
in
every
document
in
a
collection
that
MongoDB
uses
to
– Easily
divisible
which
makes
it
distribute
documents
among
the
shards
like
indexes,
they
can
be
easy
for
MongoDB
to
distribute
either
a
single
1ield,
or
a
compound
key
content
among
the
shards
– Higher
“randomness”
– Targeted
queries
– May
need
to
be
computed
CIGNEX
Datamatics
Con1idential
www.cignex.com
19
20. Choosing
Right
Shard
Key
Different
approach
for
Shard
Keys
• Approach
1:
Random
Key
–
UserId
• Approach
2:
Coarsely
ascending
key
+
Random
Key
–
YearMonth
+
UserId
CIGNEX
Datamatics
Con1idential
www.cignex.com
20
22. Results
-‐
INSERTS
Approach
1
Over
80
million
documents
inserted
with
a
decreasing
threshold
over
10
million
Approach
2
Over
225
million
documents
inserted
at
a
stable
rate
of
6000
documents/sec
Benchmarks
done
on
8GB
Test
H/W
Machines
CIGNEX
Datamatics
Con1idential
www.cignex.com
22
23. Results
-‐
UPDATES
Approach
1
Over
50
million
documents
updated
at
avg.
400
documents/sec
Approach
2
Over
100
million
documents
updated
at
as
high
as.
4000
documents/sec
Benchmarks
done
on
8GB
Test
H/W
Machines
CIGNEX
Datamatics
Con1idential
www.cignex.com
23
24. Results
–
INSERT,
UPDATE
Approach
2
Simultaneous
INSERT
>6000
documents/
second
>70
million
records
Simultaneous
UPDATE
>6000
documents/
second
>50
million
records
Benchmarks
done
on
8GB
Test
H/W
Machines
CIGNEX
Datamatics
Con1idential
www.cignex.com
24
26. Key
Takeaways
• Comprehensive
approach
on
Performance
Tuning
• Plan
Early
for
Performance
• MongoDB
scales
&
shines
• Sharding
scales
INSERTS/UPDATES
vs.
Non
sharding
• Sharding
with
Approach
2
(Coarsely
ascending
Key
+
Random
Key)
provides
sustained
results
&
better
utilization
of
the
RAM
• Different
set
of
server/s
for
NON-‐Sharded
collections
• Indexes
to
be
de1ined
carefully
• Sharded
collections
to
have
minimal
number
of
indexes
CIGNEX
Datamatics
Con1idential
www.cignex.com
26
27. Thank
You.
Any
Questions
?
Making
Open
Source
Work
For
queries
reach
out
to
us
at
info@cignex.com
CIGNEX
Datamatics
Con1idential
www.cignex.com