SlideShare a Scribd company logo
1 of 35
Download to read offline
Introduction toIntroduction to
Kyoto ProductsKyoto Products
Kyoto ProductsKyoto Products
● Kyoto Cabinet
● a lightweight database library
– a straightforward implementation of DBM
● Kyoto Tycoon
● a lightweight database server
– a persistent cache server based on Kyoto Cabinet
Kyoto CabinetKyoto Cabinet
­ lightweight database library ­
FeaturesFeatures
● straightforward implementation of DBM
● process­embedded database
● persistent associative array = key­value storage
● successor of QDBM, and sibling of Tokyo Cabinet
– e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB
● C++03 (with TR1) and C++0x porable
– supports Linux, FreeBSD, Mac OS X, Solaris, Windows
● high performance
● insert 1M records / 0.8 sec = 1,250,000 QPS
● search 1M records / 0.7 sec = 1,428,571 QPS
● high concurrency
● multi­thread safe
● read/write locking by records, user­land locking by CAS
● high scalability
● hash and B+tree structure = O(1) and O(log N)
● no actual limit size of a database file (up to 8 exa bytes)
● transaction
● write ahead logging, shadow paging 
● ACID properties
● various database types
● storage selection: on­memory, single file, multiple files in a directory
● algorithm selection: hash table, B+ tree
● utilities: iterator, prefix/regex matching, MapReduce, index database
● other language bindings
● C, Java, Python, Ruby, Perl, Lua, and so on
HashDB: File Hash DatabaseHashDB: File Hash Database
● static hashing
● O(1) time complexity
● jilted dynamic hashing for simplicity 
and performance
● separate chaining
● binary search tree
● balances by the second hash
● free block pool
● best fit allocation
● dynamic defragmentation
● combines mmap and pread/pwrite
● saves calling system calls
● tuning parameters
● bucket number, alignment
● compression
TreeDB: File Tree DatabaseTreeDB: File Tree Database
● B+ tree
● O(log N) time complexity
● custom comparison function
● page caching
● separate LRU lists
● mid­point insertion
● stands on HashDB
● stores pages in HashDB
● succeeds time and space 
efficiency
● cursor
● jump, step, step_back
● prefix/range matching
HashDB vs. TreeDBHashDB vs. TreeDB
time efficiency faster, O(1) slower, O(log N)
space efficiency larger, 16 bytes/record footprint smaller, 3 bytes/record footprint
concurrency higher, locked by the record lower, locked by the page
cursor
fragmentation more incident less incident
random access faster slower
sequential access slower faster
transaction faster slower
memory usage smaller larger
compression less efficient more efficient
tuning points number of buckets, size of mapped memory size of each page, capacity of the page cache
typical use cases
HashDB (hash table) TreeDB (B+ tree)
undefined order
jump by full matching, unable to step back
ordered by application logic
jump by range matching, able to step back
session management of web service
user account database
document database
access counter
cache of content management system
graph/text mining
job/message queue
sub index of relational database
dictionary of words
inverted index of full-text search
temporary storage of map-reduce
archive of many small files
Directory DatabasesDirectory Databases
● DirDB: directory hash database
● respective files in a directory of the file system
● suitable for large records
● strongly depends on the file system performance
– ReiserFS > EXT3 > EXT4 > EXT2 > XFS > NTFS
● ForestDB: directory tree database
● B+ tree stands on DirDB
● reduces the number of files and alleviates the load
● the most scalable approach among all database types
On­memory DatabasesOn­memory Databases
● ProtoDB: prototype database template
● DB wrapper for STL map
● any data structure compatible std::map are available
● ProtoHashDB: alias of ProtoDB<std::unordered_map>
● ProtoTreeDB: alias of ProtoDB<std::map>
● StashDB: stash database
● static hash table with packed serialization
● much less memory usage than std::map and std::unordered_map
● CacheDB: cache hash database
● static hash table with doubly­linked list
● constant memory usage
● LRU (least recent used) records are removed
● GrassDB: cache tree database
● B+ tree stands on CacheDB 
● much less memory usage than CacheDB
performance characteristicsperformance characteristics
class name persistence algorithm complexity sequence lock unit
volatile hash table O(1) undefined
volatile red black tree O(log N) lexical order
volatile hash table O(1) undefined
volatile hash table O(1) undefined
volatile B+ tree O(log N) custom order
persistent hash table O(1) undefined
persistent B+ tree O(log N) custom order
persistent undefined undefined undefined
persistent B+ tree O(log N) custom order
persistent plain text undefined stored order
ProtoHashDB whole (rwlock)
ProtoTreeDB whole (rwlock)
StashDB record (rwlock)
CacheDB record (mutex)
GrassDB page (rwlock)
HashDB record (rwlock)
TreeDB page (rwlock)
DirDB record (rwlock)
ForestDB page (rwlock)
TextDB record (rwlock)
Tuning ParametersTuning Parameters
● Assumption
● stores 100 million records whose average size is 64 bytes
● pursues time efficiency first, and space efficiency second
● the machine has 16GB RAM
● HashDB: suited for usual cases
● opts=TLINEAR, bnum=200M, msiz=12G
● TreeDB: suited for ordered access
● opts=TLINEAR, bnum=10M, msiz=10G, pccap=2G
● StashDB: suited for fast on­memory associative array
● bnum=100M
● CacheDB: suited for cache with LRU deletion 
● bnum=100M, capcnt=100M
● GrassDB: suited for memory saving on­memory associative array
● bnum=10M, psiz=2048, pccap=2G
● DirDB and ForestDB: not suited for small records
Class HierarchyClass Hierarchy
● DB = interface of record operations
● BasicDB = interface of storage operations, mix­in of utilities
– ProtoHashDB, ProtoTreeDB, CacheDB, HashDB, TreeDB, ... 
– PolyDB
● PolyDB: polymorphic database
● dynamic binding to concrete DB types
– "factory method" and "strategy" patterns
● the concrete type is determined when opening
– naming convention
● ProtoHashDB: "­", ProtoTreeDB: "+", CacheDB: "*", GrassDB: "%"
● HashDB: ".kch", TreeDB: ".kct", DirDB: ".kcd", ForestDB: ".kcf"
● TextDB: ".kcx" or ".txt"
C++C++
C++03 & TR1 libsC++03 & TR1 libs
CC JavaJava PythonPython RubyRuby PerlPerl LuaLua etc.etc.
applicationsapplications
ComponentsComponents
ProtoHashDB
ProtoHashDB
ProtoTreeDB
ProtoTreeDB
ProtoDBProtoDB
GrassDBGrassDB
CacheDBCacheDB
TreeDBTreeDB ForestDBForestDB
PolyDBPolyDB
HashDBHashDB DirDBDirDB
IO abstIO abst threading abstthreading abst DB abstDB abst
Operating Systems (POSIX, Win32)Operating Systems (POSIX, Win32)
StashDBStashDB
Abstraction of KVSAbstraction of KVS
● what is "key­value storage" ?
● each record consists of one key and one value
● atomicity is assured for only one record
● records are stored in persistent storage
● so, what?
● every operation can be abstracted by the "visitor" pattern
● the database accepts one visitor for a record at the same time
– lets it read/write the record arbitrary
– saves the computed value
● flexible and useful interface
● provides the "DB::accept" method realizing whatever operations
● "DB::set", "DB::get", "DB::remove", "DB::increment", etc. are built­in 
as wrappers of the "DB::accept"
Visitor InterfaceVisitor Interface
Comparison with Tokyo CabinetComparison with Tokyo Cabinet
● pros
● space efficiency: smaller size of DB file
– footprint/record: TC=22B ­> KC=16B
● parallelism: higher performance in multi­thread environment
– uses atomic operations such as CAS
● portability: non­POSIX platform support
– supports Win32
● usability: object­oriented design
– external cursor, generalization by the visitor pattern
– new database types: StashDB, GrassDB, ForestDB
● robustness: auto transaction and auto recovery
● cons
● time efficiency per thread: due to grained lock
● discarded features: fixed­length database, table database
● dependency on modern C++ implementation: jilted old environments
Example CodesExample Codes
#include <kcpolydb.h>
using namespace std;
using namespace kyotocabinet;
// main routine
int main(int argc, char** argv) {
// create the database object
PolyDB db;
// open the database
if (!db.open("casket.kch", PolyDB::OWRITER | PolyDB::OCREATE)) {
cerr << "open error: " << db.error().name() << endl;
}
// store records
if (!db.set("foo", "hop") ||
!db.set("bar", "step") ||
!db.set("baz", "jump")) {
cerr << "set error: " << db.error().name() << endl;
}
// retrieve a record
string* value = db.get("foo");
if (value) {
cout << *value << endl;
delete value;
} else {
cerr << "get error: " << db.error().name() << endl;
}
// traverse records
DB::Cursor* cur = db.cursor();
cur->jump();
pair<string, string>* rec;
while ((rec = cur->get_pair(true)) != NULL) {
cout << rec->first << ":" << rec->second << endl;
delete rec;
}
delete cur;
// close the database
if (!db.close()) {
cerr << "close error: " << db.error().name() << endl;
}
return 0;
}
#include <kcpolydb.h>
using namespace std;
using namespace kyotocabinet;
// main routine
int main(int argc, char** argv) {
// create the database object
PolyDB db;
// open the database
if (!db.open("casket.kch", PolyDB::OREADER)) {
cerr << "open error: " << db.error().name() << endl;
}
// define the visitor
class VisitorImpl : public DB::Visitor {
// call back function for an existing record
const char* visit_full(const char* kbuf, size_t ksiz,
const char* vbuf, size_t vsiz, size_t *sp) {
cout << string(kbuf, ksiz) << ":" << string(vbuf, vsiz) << endl;
return NOP;
}
// call back function for an empty record space
const char* visit_empty(const char* kbuf, size_t ksiz, size_t *sp) {
cerr << string(kbuf, ksiz) << " is missing" << endl;
return NOP;
}
} visitor;
// retrieve a record with visitor
if (!db.accept("foo", 3, &visitor, false) ||
!db.accept("dummy", 5, &visitor, false)) {
cerr << "accept error: " << db.error().name() << endl;
}
// traverse records with visitor
if (!db.iterate(&visitor, false)) {
cerr << "iterate error: " << db.error().name() << endl;
}
// close the database
if (!db.close()) {
cerr << "close error: " << db.error().name() << endl;
}
return 0;
}
Kyoto TycoonKyoto Tycoon
­ lightweight database server ­
FeaturesFeatures
● persistent cache server based on Kyoto Cabinet
● persistent associative array = key­value storage
● client/server model = multi processes can access one database
● auto expiration
● expiration time can be given to each record
– records are removed implicitly after the expiration time
● "GC cursor" eliminates expired regions gradually
● high concurrency and performance
● resolves "c10k" problem with epoll/kqueue
● 1M queries / 25 sec = 40,000 QPS or more
● high availability
● hot backup, update logging, asynchronous replication
● background snapshot for persistence of on­memory databases
● protocol based on HTTP
● available from most popular languages
● provides access library to encapsulate the protocol
● supports bulk operations also in an efficient binary protocol
● supports "pluggable server" and includes a memcached module
● various database schema
● using the polymorphic database API of Kyoto Cabinet
● almost all operations of Kyoto Cabinet are available
● supports "pluggable database"
● scripting extension
● embeds Lua processors to define arbitrary operations
● supports visitor, cursor, and transaction
● library of network utilities
● abstraction of socket, event notifier, and thread pool
● frameworks of TCP server, HTTP server, and RPC server
NetworkingNetworking
● abstraction of system­dependent mechanisms
● encapsulates threading mechanisms
– thread, mutex, rwlock, spinlock, condition variable, TLS, and atomic integer
– task queue with thread pool
● encapsulates Berkeley socket and Win32 socket
– client socket, server socket, and name resolver
● encapsulates event notifiers
– "epoll" on Linux, "kqueue" on BSD, "select" on other POSIX and Win32
– substitutes for "libevent" or "libev"
● TSV­RPC
● lightweight substitute for XML­RPC
● input data and output data are associative arrays of strings
– serialized as tab separated values
● RESTful interface
● GET for db::get, PUT for db::set, DELETE for db::remove
Example MessagesExample Messages
db::set over TSV-RPC
POST /rpc/set HTTP/1.1
Content-Length: 22
Content-Type: text/tab-separated-values
key japan
value tokyo
HTTP/1.1 200 OK
Content-Length: 0
db::get over TSV-RPC
POST /rpc/get HTTP/1.1
Content-Length: 10
Content-Type: text/tab-separated-values
key japan
HTTP/1.1 200 OK
Content-Length: 12
Content-Type: text/tab-separated-values
value tokyo
db::set over RESTful
PUT /japan HTTP/1.1
Content-Length: 5
Content-Type: application/octet-stream
tokyo
HTTP/1.1 201 Created
Content-Length: 0
db::get over RESTful
GET /japan HTTP/1.1
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: application/octet-stream
tokyo
Threaded TCP ServerThreaded TCP Server
Event Notifier
observed objects
notified objects
Event Notifier
deposit
wait
withdraw
observed events
notified events
Task Queue
add
queue
signal
wait
condition
variable
signal
wait
condition
variable
signalprocess worker thread
signalprocess worker thread
signalprocess worker thread
accept
listen
detect new
connection
register the
server socket
consume each
request
process each
request
undo
reuse or discard
the connection
register a request
from a client
Asynchronous ReplicationAsynchronous Replication
master server
(active)
database
update
logs
client
master server
(standby)
database
update
logs
slave servers
database
slave servers
database
slave servers
database
slave servers
database
one way replication
(master-slave)
two way replication
(dual-master)
load balancing
Update Monitoring by SignalsUpdate Monitoring by Signals
● named condition variables
● waits for a signal of a named condition before each operation
● sends a signal to a named condition after each operation
● task queue
● uses B+ tree and stores records whose keys are time stamps
– every records are sorted in the ascending order of time stamps
● writers store tasks and send signals, readers wait for signals and do
database
(job queue)
task producers
(front-end apps)
task consumers
(workers)
condition
variable
1: store a task 2: send a signal
3: detect a signal
4: retrieve a record
Scripting Extension with LuaScripting Extension with Lua
● prominent time efficiency and space efficiency
● suitable for embedding use
● extreme concurrency
● each native thread has respective Lua instance
● poor functionality, compensated for by C functions
● full set of the Lua binding of Kyoto Cabinet
● pack, unpack, split, codec, and bit operation utilities
Kyoto Tycoon server
Kyoto Cabinet Lua processor
database Lua script
clients
input
- function name
- associative array
output
- status code
- associative array
Local MapReduce in LuaLocal MapReduce in Lua
● for on­demand aggregate calculation
● the "map" and the "reduce" functions defined by users in Lua
● local = not distributed
● fast cache and sort implementation in C++, no overhead for networking
● space saving merge sort by compressed B+ trees
MapReduce
map
emit
reduce
key value
key value
key value
key value
scanning
cache
compressed
B+ tree DB
sort & flush
heap
merge & sort
Kyoto Cabinet (database, threading utilities)Kyoto Cabinet (database, threading utilities)
Operating Systems (POSIX, Win32)Operating Systems (POSIX, Win32)
database serverdatabase server Lua processorLua processordatabase clientdatabase client
TSV-RPC serverTSV-RPC server
threaded TCP serverthreaded TCP server
TSV-RPC clientTSV-RPC client
C++03 & TR1 libsC++03 & TR1 libs
applicationsapplications
ComponentsComponents
client socketclient socket
server socketserver socket event notifierevent notifier
HTTP serverHTTP serverHTTP clientHTTP client
networking utilitiesnetworking utilities
Pluggable ModulesPluggable Modules
HashDB TreeDB
DirDB ForestDB
CacheDB GrassDB
ProtoHashDB ProtoTreeDB
StashDB
PolyDB
pluggable database
shared
library
HTTP
server
pluggable
server
client
Kyoto Tycoon server process
plug-in
shared
library plug-in
update
logger
slave
server
Lua processor
Lua
script
plug-in
slave
agent
Comparison with Tokyo TyrantComparison with Tokyo Tyrant
● pros
● using Kyoto Cabinet
– inherits all pros of Kyoto Cabinet against Tokyo Cabinet
● auto expiration
– easy to use as a cache server
● multiple databases
– one server can manage multiple database instances 
● thorough layering
– database server on TSV­RPC server on HTTP server on TCP server
– easy to implement client libraries
● portability: non­POSIX platform support
– supports Win32 (work­in­progress)
● cons
● discarded features: fixed­length database, table database
Example CodesExample Codes
#include <ktremotedb.h>
using namespace std;
using namespace kyototycoon;
// main routine
int main(int argc, char** argv) {
// create the database object
RemoteDB db;
// open the database
if (!db.open()) {
cerr << "open error: " << db.error().name() << endl;
}
// store records
if (!db.set("foo", "hop") ||
!db.set("bar", "step") ||
!db.set("baz", "jump")) {
cerr << "set error: " << db.error().name() << endl;
}
// retrieve a record
string* value = db.get("foo");
if (value) {
cout << *value << endl;
delete value;
} else {
cerr << "get error: " << db.error().name() << endl;
}
// traverse records
RemoteDB::Cursor* cur = db.cursor();
cur->jump();
pair<string, string>* rec;
while ((rec = cur->get_pair(NULL, true)) != NULL) {
cout << rec->first << ":" << rec->second << endl;
delete rec;
}
delete cur;
// close the database
if (!db.close()) {
cerr << "close error: " << db.error().name() << endl;
}
return 0;
}
kt = __kyototycoon__
db = kt.db
-- retrieve the value of a record
function get(inmap, outmap)
local key = inmap.key
if not key then
return kt.RVEINVALID
end
local value, xt = db:get(key)
if value then
outmap.value = value
outmap.xt = xt
else
local err = db:error()
if err:code() == kt.Error.NOREC then
return kt.RVELOGIC
end
return kt.RVEINTERNAL
end
return kt.RVSUCCESS
end
-- list all records
function list(inmap, outmap)
local cur = db:cursor()
cur:jump()
while true do
local key, value, xt = cur:get(true)
if not key then break end
outmap.key = value
end
return kt.RVSUCCESS
end
Other Kyoto Products?Other Kyoto Products?
● now, planning...
● Kyoto Diaspora?
● inverted index using Kyoto Cabinet
● Kyoto Prominence?
● content management system using Kyoto Cabinet
maintainability is our paramount concern...
FAL LabsFAL Labs
http://fallabs.com/
mailto:info@fallabs.com
Kyoto Products lightweight databases: Kyoto Cabinet and Kyoto Tycoon

More Related Content

What's hot

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)Ontico
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1상욱 송
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategyMariaDB plc
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 
Openstack presentation
Openstack presentationOpenstack presentation
Openstack presentationbodepd
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
NoSQL 동향
NoSQL 동향NoSQL 동향
NoSQL 동향NAVER D2
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 

What's hot (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Groovy.pptx
Groovy.pptxGroovy.pptx
Groovy.pptx
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
Openstack presentation
Openstack presentationOpenstack presentation
Openstack presentation
 
Scala+data
Scala+dataScala+data
Scala+data
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
NoSQL 동향
NoSQL 동향NoSQL 동향
NoSQL 동향
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 

Similar to Kyoto Products lightweight databases: Kyoto Cabinet and Kyoto Tycoon

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Community
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeYung-Yu Chen
 
第一回MongoDBソースコードリーディング
第一回MongoDBソースコードリーディング第一回MongoDBソースコードリーディング
第一回MongoDBソースコードリーディングnobu_k
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigSelena Deckelmann
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigSelena Deckelmann
 

Similar to Kyoto Products lightweight databases: Kyoto Cabinet and Kyoto Tycoon (20)

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Bluestore
BluestoreBluestore
Bluestore
 
Bluestore
BluestoreBluestore
Bluestore
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
第一回MongoDBソースコードリーディング
第一回MongoDBソースコードリーディング第一回MongoDBソースコードリーディング
第一回MongoDBソースコードリーディング
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets big
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
NoSQL with MySQL
NoSQL with MySQLNoSQL with MySQL
NoSQL with MySQL
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets big
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Kyoto Products lightweight databases: Kyoto Cabinet and Kyoto Tycoon

  • 2. Kyoto ProductsKyoto Products ● Kyoto Cabinet ● a lightweight database library – a straightforward implementation of DBM ● Kyoto Tycoon ● a lightweight database server – a persistent cache server based on Kyoto Cabinet
  • 4. FeaturesFeatures ● straightforward implementation of DBM ● process­embedded database ● persistent associative array = key­value storage ● successor of QDBM, and sibling of Tokyo Cabinet – e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB ● C++03 (with TR1) and C++0x porable – supports Linux, FreeBSD, Mac OS X, Solaris, Windows ● high performance ● insert 1M records / 0.8 sec = 1,250,000 QPS ● search 1M records / 0.7 sec = 1,428,571 QPS
  • 5. ● high concurrency ● multi­thread safe ● read/write locking by records, user­land locking by CAS ● high scalability ● hash and B+tree structure = O(1) and O(log N) ● no actual limit size of a database file (up to 8 exa bytes) ● transaction ● write ahead logging, shadow paging  ● ACID properties ● various database types ● storage selection: on­memory, single file, multiple files in a directory ● algorithm selection: hash table, B+ tree ● utilities: iterator, prefix/regex matching, MapReduce, index database ● other language bindings ● C, Java, Python, Ruby, Perl, Lua, and so on
  • 6. HashDB: File Hash DatabaseHashDB: File Hash Database ● static hashing ● O(1) time complexity ● jilted dynamic hashing for simplicity  and performance ● separate chaining ● binary search tree ● balances by the second hash ● free block pool ● best fit allocation ● dynamic defragmentation ● combines mmap and pread/pwrite ● saves calling system calls ● tuning parameters ● bucket number, alignment ● compression
  • 7. TreeDB: File Tree DatabaseTreeDB: File Tree Database ● B+ tree ● O(log N) time complexity ● custom comparison function ● page caching ● separate LRU lists ● mid­point insertion ● stands on HashDB ● stores pages in HashDB ● succeeds time and space  efficiency ● cursor ● jump, step, step_back ● prefix/range matching
  • 8. HashDB vs. TreeDBHashDB vs. TreeDB time efficiency faster, O(1) slower, O(log N) space efficiency larger, 16 bytes/record footprint smaller, 3 bytes/record footprint concurrency higher, locked by the record lower, locked by the page cursor fragmentation more incident less incident random access faster slower sequential access slower faster transaction faster slower memory usage smaller larger compression less efficient more efficient tuning points number of buckets, size of mapped memory size of each page, capacity of the page cache typical use cases HashDB (hash table) TreeDB (B+ tree) undefined order jump by full matching, unable to step back ordered by application logic jump by range matching, able to step back session management of web service user account database document database access counter cache of content management system graph/text mining job/message queue sub index of relational database dictionary of words inverted index of full-text search temporary storage of map-reduce archive of many small files
  • 9. Directory DatabasesDirectory Databases ● DirDB: directory hash database ● respective files in a directory of the file system ● suitable for large records ● strongly depends on the file system performance – ReiserFS > EXT3 > EXT4 > EXT2 > XFS > NTFS ● ForestDB: directory tree database ● B+ tree stands on DirDB ● reduces the number of files and alleviates the load ● the most scalable approach among all database types
  • 10. On­memory DatabasesOn­memory Databases ● ProtoDB: prototype database template ● DB wrapper for STL map ● any data structure compatible std::map are available ● ProtoHashDB: alias of ProtoDB<std::unordered_map> ● ProtoTreeDB: alias of ProtoDB<std::map> ● StashDB: stash database ● static hash table with packed serialization ● much less memory usage than std::map and std::unordered_map ● CacheDB: cache hash database ● static hash table with doubly­linked list ● constant memory usage ● LRU (least recent used) records are removed ● GrassDB: cache tree database ● B+ tree stands on CacheDB  ● much less memory usage than CacheDB
  • 11. performance characteristicsperformance characteristics class name persistence algorithm complexity sequence lock unit volatile hash table O(1) undefined volatile red black tree O(log N) lexical order volatile hash table O(1) undefined volatile hash table O(1) undefined volatile B+ tree O(log N) custom order persistent hash table O(1) undefined persistent B+ tree O(log N) custom order persistent undefined undefined undefined persistent B+ tree O(log N) custom order persistent plain text undefined stored order ProtoHashDB whole (rwlock) ProtoTreeDB whole (rwlock) StashDB record (rwlock) CacheDB record (mutex) GrassDB page (rwlock) HashDB record (rwlock) TreeDB page (rwlock) DirDB record (rwlock) ForestDB page (rwlock) TextDB record (rwlock)
  • 12. Tuning ParametersTuning Parameters ● Assumption ● stores 100 million records whose average size is 64 bytes ● pursues time efficiency first, and space efficiency second ● the machine has 16GB RAM ● HashDB: suited for usual cases ● opts=TLINEAR, bnum=200M, msiz=12G ● TreeDB: suited for ordered access ● opts=TLINEAR, bnum=10M, msiz=10G, pccap=2G ● StashDB: suited for fast on­memory associative array ● bnum=100M ● CacheDB: suited for cache with LRU deletion  ● bnum=100M, capcnt=100M ● GrassDB: suited for memory saving on­memory associative array ● bnum=10M, psiz=2048, pccap=2G ● DirDB and ForestDB: not suited for small records
  • 13. Class HierarchyClass Hierarchy ● DB = interface of record operations ● BasicDB = interface of storage operations, mix­in of utilities – ProtoHashDB, ProtoTreeDB, CacheDB, HashDB, TreeDB, ...  – PolyDB ● PolyDB: polymorphic database ● dynamic binding to concrete DB types – "factory method" and "strategy" patterns ● the concrete type is determined when opening – naming convention ● ProtoHashDB: "­", ProtoTreeDB: "+", CacheDB: "*", GrassDB: "%" ● HashDB: ".kch", TreeDB: ".kct", DirDB: ".kcd", ForestDB: ".kcf" ● TextDB: ".kcx" or ".txt"
  • 14. C++C++ C++03 & TR1 libsC++03 & TR1 libs CC JavaJava PythonPython RubyRuby PerlPerl LuaLua etc.etc. applicationsapplications ComponentsComponents ProtoHashDB ProtoHashDB ProtoTreeDB ProtoTreeDB ProtoDBProtoDB GrassDBGrassDB CacheDBCacheDB TreeDBTreeDB ForestDBForestDB PolyDBPolyDB HashDBHashDB DirDBDirDB IO abstIO abst threading abstthreading abst DB abstDB abst Operating Systems (POSIX, Win32)Operating Systems (POSIX, Win32) StashDBStashDB
  • 15. Abstraction of KVSAbstraction of KVS ● what is "key­value storage" ? ● each record consists of one key and one value ● atomicity is assured for only one record ● records are stored in persistent storage ● so, what? ● every operation can be abstracted by the "visitor" pattern ● the database accepts one visitor for a record at the same time – lets it read/write the record arbitrary – saves the computed value ● flexible and useful interface ● provides the "DB::accept" method realizing whatever operations ● "DB::set", "DB::get", "DB::remove", "DB::increment", etc. are built­in  as wrappers of the "DB::accept"
  • 17. Comparison with Tokyo CabinetComparison with Tokyo Cabinet ● pros ● space efficiency: smaller size of DB file – footprint/record: TC=22B ­> KC=16B ● parallelism: higher performance in multi­thread environment – uses atomic operations such as CAS ● portability: non­POSIX platform support – supports Win32 ● usability: object­oriented design – external cursor, generalization by the visitor pattern – new database types: StashDB, GrassDB, ForestDB ● robustness: auto transaction and auto recovery ● cons ● time efficiency per thread: due to grained lock ● discarded features: fixed­length database, table database ● dependency on modern C++ implementation: jilted old environments
  • 18. Example CodesExample Codes #include <kcpolydb.h> using namespace std; using namespace kyotocabinet; // main routine int main(int argc, char** argv) { // create the database object PolyDB db; // open the database if (!db.open("casket.kch", PolyDB::OWRITER | PolyDB::OCREATE)) { cerr << "open error: " << db.error().name() << endl; } // store records if (!db.set("foo", "hop") || !db.set("bar", "step") || !db.set("baz", "jump")) { cerr << "set error: " << db.error().name() << endl; } // retrieve a record string* value = db.get("foo"); if (value) { cout << *value << endl; delete value; } else { cerr << "get error: " << db.error().name() << endl; } // traverse records DB::Cursor* cur = db.cursor(); cur->jump(); pair<string, string>* rec; while ((rec = cur->get_pair(true)) != NULL) { cout << rec->first << ":" << rec->second << endl; delete rec; } delete cur; // close the database if (!db.close()) { cerr << "close error: " << db.error().name() << endl; } return 0; } #include <kcpolydb.h> using namespace std; using namespace kyotocabinet; // main routine int main(int argc, char** argv) { // create the database object PolyDB db; // open the database if (!db.open("casket.kch", PolyDB::OREADER)) { cerr << "open error: " << db.error().name() << endl; } // define the visitor class VisitorImpl : public DB::Visitor { // call back function for an existing record const char* visit_full(const char* kbuf, size_t ksiz, const char* vbuf, size_t vsiz, size_t *sp) { cout << string(kbuf, ksiz) << ":" << string(vbuf, vsiz) << endl; return NOP; } // call back function for an empty record space const char* visit_empty(const char* kbuf, size_t ksiz, size_t *sp) { cerr << string(kbuf, ksiz) << " is missing" << endl; return NOP; } } visitor; // retrieve a record with visitor if (!db.accept("foo", 3, &visitor, false) || !db.accept("dummy", 5, &visitor, false)) { cerr << "accept error: " << db.error().name() << endl; } // traverse records with visitor if (!db.iterate(&visitor, false)) { cerr << "iterate error: " << db.error().name() << endl; } // close the database if (!db.close()) { cerr << "close error: " << db.error().name() << endl; } return 0; }
  • 20. FeaturesFeatures ● persistent cache server based on Kyoto Cabinet ● persistent associative array = key­value storage ● client/server model = multi processes can access one database ● auto expiration ● expiration time can be given to each record – records are removed implicitly after the expiration time ● "GC cursor" eliminates expired regions gradually ● high concurrency and performance ● resolves "c10k" problem with epoll/kqueue ● 1M queries / 25 sec = 40,000 QPS or more ● high availability ● hot backup, update logging, asynchronous replication ● background snapshot for persistence of on­memory databases
  • 21. ● protocol based on HTTP ● available from most popular languages ● provides access library to encapsulate the protocol ● supports bulk operations also in an efficient binary protocol ● supports "pluggable server" and includes a memcached module ● various database schema ● using the polymorphic database API of Kyoto Cabinet ● almost all operations of Kyoto Cabinet are available ● supports "pluggable database" ● scripting extension ● embeds Lua processors to define arbitrary operations ● supports visitor, cursor, and transaction ● library of network utilities ● abstraction of socket, event notifier, and thread pool ● frameworks of TCP server, HTTP server, and RPC server
  • 22. NetworkingNetworking ● abstraction of system­dependent mechanisms ● encapsulates threading mechanisms – thread, mutex, rwlock, spinlock, condition variable, TLS, and atomic integer – task queue with thread pool ● encapsulates Berkeley socket and Win32 socket – client socket, server socket, and name resolver ● encapsulates event notifiers – "epoll" on Linux, "kqueue" on BSD, "select" on other POSIX and Win32 – substitutes for "libevent" or "libev" ● TSV­RPC ● lightweight substitute for XML­RPC ● input data and output data are associative arrays of strings – serialized as tab separated values ● RESTful interface ● GET for db::get, PUT for db::set, DELETE for db::remove
  • 23. Example MessagesExample Messages db::set over TSV-RPC POST /rpc/set HTTP/1.1 Content-Length: 22 Content-Type: text/tab-separated-values key japan value tokyo HTTP/1.1 200 OK Content-Length: 0 db::get over TSV-RPC POST /rpc/get HTTP/1.1 Content-Length: 10 Content-Type: text/tab-separated-values key japan HTTP/1.1 200 OK Content-Length: 12 Content-Type: text/tab-separated-values value tokyo db::set over RESTful PUT /japan HTTP/1.1 Content-Length: 5 Content-Type: application/octet-stream tokyo HTTP/1.1 201 Created Content-Length: 0 db::get over RESTful GET /japan HTTP/1.1 HTTP/1.1 200 OK Content-Length: 5 Content-Type: application/octet-stream tokyo
  • 24. Threaded TCP ServerThreaded TCP Server Event Notifier observed objects notified objects Event Notifier deposit wait withdraw observed events notified events Task Queue add queue signal wait condition variable signal wait condition variable signalprocess worker thread signalprocess worker thread signalprocess worker thread accept listen detect new connection register the server socket consume each request process each request undo reuse or discard the connection register a request from a client
  • 25. Asynchronous ReplicationAsynchronous Replication master server (active) database update logs client master server (standby) database update logs slave servers database slave servers database slave servers database slave servers database one way replication (master-slave) two way replication (dual-master) load balancing
  • 26. Update Monitoring by SignalsUpdate Monitoring by Signals ● named condition variables ● waits for a signal of a named condition before each operation ● sends a signal to a named condition after each operation ● task queue ● uses B+ tree and stores records whose keys are time stamps – every records are sorted in the ascending order of time stamps ● writers store tasks and send signals, readers wait for signals and do database (job queue) task producers (front-end apps) task consumers (workers) condition variable 1: store a task 2: send a signal 3: detect a signal 4: retrieve a record
  • 27. Scripting Extension with LuaScripting Extension with Lua ● prominent time efficiency and space efficiency ● suitable for embedding use ● extreme concurrency ● each native thread has respective Lua instance ● poor functionality, compensated for by C functions ● full set of the Lua binding of Kyoto Cabinet ● pack, unpack, split, codec, and bit operation utilities Kyoto Tycoon server Kyoto Cabinet Lua processor database Lua script clients input - function name - associative array output - status code - associative array
  • 28. Local MapReduce in LuaLocal MapReduce in Lua ● for on­demand aggregate calculation ● the "map" and the "reduce" functions defined by users in Lua ● local = not distributed ● fast cache and sort implementation in C++, no overhead for networking ● space saving merge sort by compressed B+ trees MapReduce map emit reduce key value key value key value key value scanning cache compressed B+ tree DB sort & flush heap merge & sort
  • 29. Kyoto Cabinet (database, threading utilities)Kyoto Cabinet (database, threading utilities) Operating Systems (POSIX, Win32)Operating Systems (POSIX, Win32) database serverdatabase server Lua processorLua processordatabase clientdatabase client TSV-RPC serverTSV-RPC server threaded TCP serverthreaded TCP server TSV-RPC clientTSV-RPC client C++03 & TR1 libsC++03 & TR1 libs applicationsapplications ComponentsComponents client socketclient socket server socketserver socket event notifierevent notifier HTTP serverHTTP serverHTTP clientHTTP client networking utilitiesnetworking utilities
  • 30. Pluggable ModulesPluggable Modules HashDB TreeDB DirDB ForestDB CacheDB GrassDB ProtoHashDB ProtoTreeDB StashDB PolyDB pluggable database shared library HTTP server pluggable server client Kyoto Tycoon server process plug-in shared library plug-in update logger slave server Lua processor Lua script plug-in slave agent
  • 31. Comparison with Tokyo TyrantComparison with Tokyo Tyrant ● pros ● using Kyoto Cabinet – inherits all pros of Kyoto Cabinet against Tokyo Cabinet ● auto expiration – easy to use as a cache server ● multiple databases – one server can manage multiple database instances  ● thorough layering – database server on TSV­RPC server on HTTP server on TCP server – easy to implement client libraries ● portability: non­POSIX platform support – supports Win32 (work­in­progress) ● cons ● discarded features: fixed­length database, table database
  • 32. Example CodesExample Codes #include <ktremotedb.h> using namespace std; using namespace kyototycoon; // main routine int main(int argc, char** argv) { // create the database object RemoteDB db; // open the database if (!db.open()) { cerr << "open error: " << db.error().name() << endl; } // store records if (!db.set("foo", "hop") || !db.set("bar", "step") || !db.set("baz", "jump")) { cerr << "set error: " << db.error().name() << endl; } // retrieve a record string* value = db.get("foo"); if (value) { cout << *value << endl; delete value; } else { cerr << "get error: " << db.error().name() << endl; } // traverse records RemoteDB::Cursor* cur = db.cursor(); cur->jump(); pair<string, string>* rec; while ((rec = cur->get_pair(NULL, true)) != NULL) { cout << rec->first << ":" << rec->second << endl; delete rec; } delete cur; // close the database if (!db.close()) { cerr << "close error: " << db.error().name() << endl; } return 0; } kt = __kyototycoon__ db = kt.db -- retrieve the value of a record function get(inmap, outmap) local key = inmap.key if not key then return kt.RVEINVALID end local value, xt = db:get(key) if value then outmap.value = value outmap.xt = xt else local err = db:error() if err:code() == kt.Error.NOREC then return kt.RVELOGIC end return kt.RVEINTERNAL end return kt.RVSUCCESS end -- list all records function list(inmap, outmap) local cur = db:cursor() cur:jump() while true do local key, value, xt = cur:get(true) if not key then break end outmap.key = value end return kt.RVSUCCESS end
  • 33. Other Kyoto Products?Other Kyoto Products? ● now, planning... ● Kyoto Diaspora? ● inverted index using Kyoto Cabinet ● Kyoto Prominence? ● content management system using Kyoto Cabinet