1. MongoDB's default indexing method causes performance issues as it locks the entire replica set during indexing.
2. Background indexing in newer versions may help but still risks slowing down all secondaries simultaneously.
3. The presenter suggests a manual indexing method where each secondary can index independently without affecting the others. This would distribute the indexing load and prevent replica set-wide slowdowns.
4. Introduction
Profile
Name: 窪田 博昭 Hiroaki Kubota
Company: Rakuten Inc.
Unit: ACT = Development Unit Architect Group
Mail: hiroaki.kubota@mail.rakuten.com
Hobby: Futsal , Golf
Recent: My physical power has gradual declined...
twitter : crumbjp
github: crumbjp
4
5. Introduction
Agenda
• Introduction
• Mongo’s characteristic
• How to take advantage of the mongo for our service
– Our new system “cockatoo”
– MapReduce
• Structure & Performance
• Performance example ( on EC2 large )
• Major problems...
– Indexing
– STALE
– Diskspace
– PHP client
• Closing
5
7. Mongo’s characteristic
Mongo’s ... / Mongo has ... / Mongo is ...
READ performance is extremely good !
WRITE performance is so-so,but cannot be scalable.
To READ data immediately after it is WRITTEN is bad.
Very high availability !
Under development.
Maintenance tools are poor.
Some useless operations.
7
8. How to take advantages of the Mongo
for the infoseek news
8
9. Our new system “Cockatoo”
(used to be call “Albatross”)
9
14. Cockatoo structure
Internet
Internet
Request SessionDB
LayoutDB Gat page layout
MongoDB
WEB
WEB ReplSet
MongoDB
ReplSet Get components
Call APIs Memcache
API
API
Retrieve data
ContentsDB MongoDB
ReplSet 14
15. Cockatoo structure
Internet
Internet
Request SessionDB
LayoutDB Gat page layout
Mongo’s READ performance MongoDB
is WEB
WEB
enough to cope with ReplSet
MongoDB
ReplSet WEB PV.
Get components
Call APIs Memcache
But WRITE performance is
not enough.
API
API
Retrieve data
ContentsDB MongoDB
ReplSet 15
16. Cockatoo structure
Internet
Internet
Request SessionDB
LayoutDB Gat page layout
MongoDB
WEB
WEB ReplSet
MongoDB
ReplSet Get components
Call APIs Memcache
API
API
Retrieve data
ContentsDB MongoDB
ReplSet 16
17. Cockatoo structure
Internet
Internet
Request SessionDB
LayoutDB Gat page layout
MongoDB
WEB
WEB ReplSet
MongoDB
ReplSet Get components
Call APIs Memcache
API
API
Zookeeper
Retrieve data
ContentsDB MongoDB
ReplSet 17
18. Cockatoo structure
Internet
Internet
Request SessionDB
LayoutDB Gat page layout
MongoDB
WEB
WEB ReplSet
MongoDB
ReplSet Get components
Call APIs Memcache
API
API
Zookeeper Solr
Retrieve data
ContentsDB MongoDB
ReplSet 18
19. Cockatoo structure
Developer
HTML markup
LayoutDB Set page layout & Deploy API
API settings
CMS Batch servers
MongoDB
ReplSet Set components
Insert Data
API servers
API servers
Set static contents
ContentsDB MongoDB
ReplSet 19
24. MapReduce
Our usage
We have never used MapReduce as regular operation.
However, We have used it for some irreglar case.
• To search the invalid articles that should be removed
because of someone’s mistakes...
• To analyze the number of new articles posted a day.
• To analyze the updated number an article.
• We get start considering to use it regularly for the
social data analyzing before long ... 24
26. Structure
We are using very poor machine (Virtual machine) !!
• Intel(R) Xeon(R) CPU X5650 2.67GHz
1core!!
• 4GB memory
• 50 GB disk space ( iScsi )
• CentOS5.5 64bit
• mongodb 1.8.0
– ReplicaSet 5 nodes ( + 1 Arbiter)
– Oplog size 1.2GB
– Average object size 1KB
26
27. Structure
Researched environment
We’ve also researched following environments...
• Virtual machine 1 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• Virtual machine 3 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
27
28. Performance
I found the formula for making a rough estimation of QPS
1~8 kb documents + 1 unique index
C = Number of CPU cores (Xeon 2.67 GHz)
DD = Score of ‘dd’ command (byte/sec)
S = Document size (byte)
• GET qps = 4500 × C
• SET(fsync) bytes/s = 0.05×DD ÷ S
• SET(nsync) qps = 4500 BUT...
have chance of STALE
28
30. Performance example (on EC2 large)
Environment and amount of data
EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
– 1 unique index
Data-type
{
shop: 'someone',
item: 'something',
description: 'item explanation sentences...‘
} 30
37. Index probrem
Online indexisng is completely useless even if last version (2.0.2)
Indexing is lock operation in default.
Indexing operation can run as background
on the primary. But...
It CANNOT run as background on the secondary
Moreover the all secondary’s indexing run
at the same time !!
Result in above...
All slave freezes ! orz...
37
51. Index probrem
Accoding to mongodb.org this probrem will fix in 2.1.0
But not released formally.
So I checked out the source code up to date.
Certainlly it’ll be fixed !
Moreover it sounds like it’ll run as foreground
when slave status isn’t SECONDARY
(Does it means RECOVERING ?)
51
52. Index probrem
Probable 2.1.X indexing
Primary
save
Batch
Secondary Secondary Secondary
Client Client Client Client Client 52
56. Index probrem
Background indexing 2.1.X
But I think it’s not enough.
I think it can bring failure to our system when
the all secondaries slowdown at the same time !!
So...
56
64. Index probrem
But ... I easilly guess it’s difficult to apply for current Oplog
It would be great if I can operate indexing
manually
at each secondaries
64
74. Index probrem
Manual indexing
Primary
Batch
It needs to support Complete
ensureIndex(manual,background)
background operation
Secondary Secondary Secondary
Slowdown
Complete Complete Indexing
Just in case,if the ReplSet has only
one Secondary
Client Client Client Client Client 74
80. Unknown log & Out of control the ReplSet
We often suffered from going out of control the Secondaries...
• Secondaries change status repeatedly in a
moment
between Secondary and Recovering
(1.8.0)
• Then we found the strange line in the log...
[rsSync] replSet error RS102 too stale to
catch up
80
81. What’s Stale ?
stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp
• 〈食品・飲料などが〉新鮮でない(⇔fresh);
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする
81
82. What’s Stale ?
stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp
• 〈食品・飲料などが〉新鮮でない(⇔fresh);
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする
どうも非常によろしくないらしい・・・
82
90. Stale
Client
Client
mongod mongod
Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 90
91. Insert & Replication 2
B
Client
Client Insert
Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 91
92. Insert & Replication 2
C
Client
Client Insert
Insert C
C Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 92
93. Insert & Replication 2
A
Client
Client Update
Update A
Insert C
C Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 93
94. Insert & Replication 2
Client
Client
Check Oplog
Update A
Insert C
C Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 94
95. Insert & Replication 2
Client
Client
Sync
Update A Update A
Insert C Insert C
C Insert B C Insert B
B Insert A B Insert A
A A
Database Oplog Database Oplog
Primary Secondary 95
97. Stale
Client
Client
mongod mongod
Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 97
98. Stale
B
Client
Client Insert
Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 98
99. Stale
C
Client
Client Insert
Insert C
C Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 99
100. Stale
A
Client
Client Update
Update A
Insert C
C Insert B
B Insert A Insert A
A A
Database Oplog Database Oplog
Primary Secondary 100
101. Stale
C
Client
Client Update
Update C
Update A
C Insert C
B Insert B Insert A
A Insert A A
Database Oplog Database Oplog
Primary Secondary 101
102. Stale
D
Client
Client Insert
Insert D
D Update C
C Update A
B Insert C Insert A
A Insert B A
Database Insert A Database Oplog
Primary Secondary 102
103. Stale
Client
Client [Inset A]
not found !!
Check Oplog
Insert D
D Update C
C Update A
B Insert C Insert A
A Insert B A
Database Insert A Database Oplog
Primary Secondary 103
104. Stale
Client
Client [Inset A]
not found !!
Check Oplog
It cannot get
infomation about
[Insert B].
Insert D
D Update C
C Update A So cannot sync !!
B Insert C Insert A
A Insert B A
It’s called STALE
Database Insert A Database Oplog
Primary Recovering 104
105. Stale
We have to understand the importance of adjusting oplog size
We can specify the oplog size as one of the command
line option
Only at the first time per the dbpath
that is also specified as a command line.
Also we cannot change the oplog size
without clearing the dbpath.
Be careful !
105
107. InitialSync
Client
Client
mongod
Insert D
D Update C
C Update A
B Insert C
A
Database Oplog
Primary 107
108. InitialSync
Client
Client
mongod mongod
Insert D
D Update C
C Update A
B Insert C
A
Database Oplog Database Oplog
Primary Startup 108
109. InitialSync
Client
Client
Get last Oplog
Insert D
D Update C
C Update A
B Insert C Insert D
A
Database Oplog Database Oplog
Primary Recovering 109
110. InitialSync
D
Client
Client C
B
A Cloning DB
Insert D
D Update C
C Update A
B Insert C Insert D
A
Database Oplog Database Oplog
Primary Recovering 110
111. InitialSync
D
Client
Client C
B
A Cloning DB
Insert D
D Update C
C Update A
B Insert C Insert D
A A
Database Oplog Database Oplog
Primary Recovering 111
112. InitialSync
E D
Client
Client Insert C
B
A Cloning DB
E Insert E
D Insert D
C Update C
B B
Update A Insert D
A A
Insert C
Database Oplog Database Oplog
Primary Recovering 112
113. InitialSync
B
Client
Client Update
Cloning DB complete
E Update B
D Insert E D
C Insert D C
B Update C B Insert D
A Update A A
Database Oplog Database Oplog
Primary Recovering 113
114. InitialSync
Client
Client
Check Oplog
E Update B
D Insert E D
C Insert D C
B Update C B Insert D
A A
Database Oplog Database Oplog
Primary Recovering 114
115. InitialSync
Client
Client
Sync
E Update B E
D Insert E D Update B
C Insert D C Insert E
B Update C B Insert D
A A
Database Oplog Database Oplog
Primary Secondary 115
116. Additional infomation
From source code. ( I’ve never examed these... )
Secondary will try to sync from other Secondaries
when it cannot reach the Primary or
might be stale against the Primary.
There is a bit of chance that sync problem not occured if the
secondary has old Oplog or larger Oplog space than Primary
116
117. Sync from another secondary
Client
Client
Insert D Insert D
D Update C D Update C
C Update A C Update A
B Insert C Insert A B Insert C
A Insert B A A Insert B
Database Insert A Database Oplog Database Insert A
Primary Secondary Secondary 117
118. Sync from another secondary
Client [Inset A]
Client
not found !!
Check Oplog
Insert D Insert D
D Update C D Update C
C Update A C Update A
B Insert C Insert A B Insert C
A Insert B A A Insert B
Database Insert A Database Oplog Database Insert A
Primary Secondary Secondary 118
119. Sync from another secondary
Client But found at the other secondary
Client
So it’s able to sync
Check Oplog
Insert D Insert D
D Update C D Update C
C Update A C Update A
B Insert C Insert A B Insert C
A Insert B A A Insert B
Database Insert A Database Oplog Database Insert A
Primary Secondary Secondary 119
120. Sync from the other secondary
Client But found at the other secondary
Client
So it’s able to sync
Sync
Insert D Insert D Insert D
D Update C D Update C D Update C
C Update A C Update A C Update A
B Insert C B Insert C B Insert C
A Insert B A Insert B A Insert B
Insert A
Database Insert A Database Database Insert A
Primary Secondary Secondary 120
124. Disk space
Data fragment into any DB files sparsely...
We met the unfavorable circumstance in our DBs
This circumstance appears at some of our collections
around 3 months after we launched the services
db.ourcol.storageSize() = 16200727264 (15GB)
db.ourcol.totalSize() = 16200809184
db.ourcol.totalIndexSize() = 81920
db.outcol.dataSize() = 2032300 (2MB)
What’s happen to them !! 124
125. Disk space
Data fragment into any DB files sparsely...
It’s seems like to be caused by the specific operation
that insert , update and delete over and over.
Anyway we have to shrink the using disk space regularly
just like PostgreSQL’s vacume.
But how to do it ?
125
126. Disk space
Shrink the using disk spaces
MongoDB offers some functions for this case.
But couldn’t use in our case !
repairdatabase:
Only runable on the Primary.
It needs long time and BLOCK all operations !!
compact:
Only runable on the Secondary.
Zero-fill the blank space instead of shrink disk spaces.
So cannot shrink...
126
127. Disk space
Our measurements
For temporary collection:
To issue drop-command regularly.
For other collections:
1.Get rid of one secondary from the ReplSet.
2.Shut down this.
3.Remove all DB files.
4.Join to the ReplSet.
5.Do these operations one after another.
6.Step down the Primary. (Change Primary node)
7.At last, do 1 – 4 operations on prior Primary.
127
137. PHP client
We tried 1.1.4 and 1.2.2
1.1.4:
There is some critical bugs around connection pool.
We struggled to invalidate the broken connection.
I think, you should use 1.2.X instead of 1.1.X
1.2.2:
It seems like to be fixed around connection pool.
But there are 2 critical bugs !
–Socket handle leak
–Useless sleep
However, This version is relatively stable 137
as long as to fix these bugs
141. Closing
What’s MongoDB ?
It has very good READ performance.
We can use mongo instead of memcached.
if we can allow the limited write performance.
Die hard !
MongoDB have high availability even if under a severe stress..
Can use easilly without deep consideration
We can manage to do anything after getting start to use.
Let’s forget any awkward trivial things that have bothered us.
How to treat the huge data ?
How to put in the cache system ?
How to keep the availablity ?
And so on .... 141
142. Closing
Keep in mind
Sharding is challenging...
It’s last resort !
It’s hard to operate. In particular, to maintain config-servers.
[Mongos] is also difficult to keep alive.
I want the way to failover Mongos.
Mongo is able to run on the poor environment but...
You should ONLY put aside the large diskspace
Huge write is sensitive
Adjust the oplog size carefully
Indexing function has been unfinished
Cannot apply index online
142