11. UCSC research grant
●
“Petascale object storage”
●
DOE: LANL, LLNL, Sandia
●
Scalability
●
Reliability
●
Performance
●
●
Raw IO bandwidth, metadata ops/sec
HPC file system workloads
●
Thousands of clients writing to same file, directory
12. Distributed metadata management
●
Innovative design
●
Subtree-based partitioning for locality, efficiency
●
Dynamically adapt to current workload
●
Embedded inodes
●
Prototype simulator in Java (2004)
●
First line of Ceph code
●
Summer internship at LLNL
●
High security national lab environment
●
Could write anything, as long as it was OSS
13. The rest of Ceph
●
RADOS – distributed object storage cluster (2005)
●
EBOFS – local object storage (2004/2006)
●
CRUSH – hashing for the real world (2005)
●
Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
14.
15. Industry black hole
●
Many large storage vendors
●
●
Proprietary solutions that don't scale well
Few open source alternatives (2006)
●
●
Limited community and architecture (Lustre)
●
●
Very limited scale, or
No enterprise feature sets (snapshots, quotas)
PhD grads all built interesting systems...
●
●
...and then went to work for Netapp, DDN, EMC, Veritas.
They want you, not your project
16. A different path
●
Change the world with open source
●
●
●
Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?
License
●
●
●
GPL, BSD...
LGPL: share changes, okay to link to proprietary code
Avoid unsavory practices
●
Dual licensing
●
Copyright assignment
21. The kernel client
●
ceph-fuse was limited, not very fast
●
Build native Linux kernel implementation
●
Began attending Linux file system developer events (LSF)
●
●
●
Early words of encouragement from ex-Lustre devs
Engage Linux fs developer community as peer
Initial attempts merge rejected by Linus
●
●
●
Not sufficient evidence of user demand
A few fans and would-be users chimed in...
Eventually merged for v2.6.34 (early 2010)
22. Part of a larger ecosystem
●
Ceph need not solve all problems as monolithic stack
●
Replaced ebofs object file system with btrfs
●
●
Avoid reinventing the wheel
●
Robust, well-supported, well optimized
●
Kernel-level cache management
●
●
Same design goals
Copy-on-write, checksumming, other goodness
Contributed some early functionality
●
Cloning files
●
Async snapshots
23. Budding community
●
#ceph on irc.oftc.net, ceph-devel@vger.kernel.org
●
Many interested users
●
A few developers
●
Many fans
●
Too unstable for any real deployments
●
Still mostly focused on right architecture and technical
solutions
24. Road to product
●
●
DreamHost decides to build an S3-compatible object
storage service with Ceph
Stability
●
●
Focus on core RADOS, RBD, radosgw
Paying back some technical debt
●
●
●
Build testing automation
Code review!
Expand engineering team
25. The reality
●
Growing incoming commercial interest
●
Early attempts from organizations large and small
●
Difficult to engage with a web hosting company
●
No means to support commercial deployments
●
Project needed a company to back it
●
●
Build and test a product
●
●
Fund the engineering effort
Support users
Bryan built a framework to spin out of DreamHost
28. Do it right
●
How do we build a strong open source company?
●
How do we build a strong open source community?
●
Models?
●
●
RedHat, Cloudera, MySQL, Canonical, …
Initial funding from DreamHost, Mark Shuttleworth
29. Goals
●
A stable Ceph release for production deployment
●
●
DreamObjects
Lay foundation for widespread adoption
●
Platform support (Ubuntu, Redhat, SuSE)
●
Documentation
●
Build and test infrastructure
●
Build a sales and support organization
●
Expand engineering organization
30. Branding
●
Early decision to engage professional agency
●
●
MetaDesign
Terms like
●
●
●
“Brand core”
“Design system”
Project vs Company
●
●
●
Shared / Separate / Shared core
Inktank != Ceph
Aspirational messaging: The Future of Storage
34. Traction
●
Too many production deployments to count
●
We don't know about most of them!
●
Too many customers (for me) to count
●
Growing partner list
●
●
Lots of inbound
Lots of press and buzz
35. Quality
●
Increased adoption means increased demands on robust
testing
●
Across multiple platforms
●
Include platforms we don't like
●
Upgrades
●
●
●
Rolling upgrades
Inter-version compatibility
Expanding user community + less noise about bugs = a
good sign
36. Developer community
●
Significant external contributors
●
First-class feature contributions from contributors
●
Non-Inktank participants in daily Inktank stand-ups
●
External access to build/test lab infrastructure
●
Common toolset
●
●
Email (kernel.org)
●
●
Github
IRC (oftc.net)
Linux distros
37. CDS: Ceph Developer Summit
●
Community process for building project roadmap
●
100% online
●
Google hangouts
●
Wikis
●
Etherpad
●
First was this Spring, second is next week
●
Great feedback, growing participation
●
Indoctrinating our own developers to an open
development model
39. Governance
How do we strengthen the project community?
●
2014 is the year
●
Might formally acknowledge my role as BDL
●
Recognized project leads
●
RBD, RGW, RADOS, CephFS)
●
Formalize processes around CDS, community roadmap
●
External foundation?
40. Technical roadmap
●
How do we reach new use-cases and users
●
How do we better satisfy existing users
●
How do we ensure Ceph can succeed in enough markets
for Inktank to thrive
●
Enough breadth to expand and grow the community
●
Enough focus to do well
41. Tiering
●
●
Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage
devices?
●
●
●
●
Cache pools: promote hot objects from an existing pool into a fast
(e.g., FusionIO) pool
Cold pools: demote cold data to a slow, archival pool (e.g.,
erasure coding)
How do you identify what is hot and cold?
Common in enterprise solutions; not found in open source
scale-out systems
→ key topic at CDS next week
42. Erasure coding
●
Replication for redundancy is flexible and fast
●
For larger clusters, it can be expensive
Storage
overhead
3x replication
Repair
traffic
MTTDL
(days)
1x
2.3 E10
RS (10, 4)
1.4x
10x
3.3 E13
LRC (10, 6, 5)
●
3x
1.6x
5x
1.2 E15
Erasure coded data is hard to modify, but ideal for cold or
read-only objects
●
Cold storage tiering
●
Will be used directly by radosgw
43. Multi-datacenter, geo-replication
●
Ceph was originally designed for single DC clusters
●
●
●
Synchronous replication
Strong consistency
Growing demand
●
●
●
Enterprise: disaster recovery
ISPs: replication data across sites for locality
Two strategies:
●
use-case specific: radosgw, RBD
●
low-level capability in RADOS
44. RGW: Multi-site and async replication
●
Multi-site, multi-cluster
●
●
Zones: radosgw sub-cluster(s) within a region
●
●
Regions: east coast, west coast, etc.
Can federate across same or multiple Ceph clusters
Sync user and bucket metadata across regions
●
●
Global bucket/user namespace, like S3
Synchronize objects across zones
●
Within the same region
●
Across regions
●
Admin control over which zones are master/slave
45. RBD: simple DR via snapshots
●
Simple backup capability
●
●
Based on block device snapshots
Efficiently mirror changes between consecutive snapshots across
clusters
●
Now supported/orchestrated by OpenStack
●
Good for coarse synchronization (e.g., hours)
●
Not real-time
46. Async replication in RADOS
●
One implementation to capture multiple use-cases
●
●
RBD, CephFS, RGW, … RADOS
A harder problem
●
●
●
Scalable: 1000s OSDs → 1000s of OSDs
Point-in-time consistency
Three challenges
●
Infer a partial ordering of events in the cluster
●
Maintain a stable timeline to stream from
–
●
either checkpoints or event stream
Coordinated roll-forward at destination
–
do not apply any update until we know we have everything that
happened before it
47. CephFS
→ This is where it all started – let's get there
●
Today
●
●
●
QA coverage and bug squashing continues
NFS and CIFS now large complete and robust
Need
●
●
Directory fragmentation
●
Snapshots
●
●
Multi-MDS
QA investment
Amazing community effort
49. Big data
When will be stop talking about MapReduce?
Why is “big data” built on such a lame storage model?
●
Move computation to the data
●
Evangelize RADOS classes
●
librados case studies and proof points
●
Build a general purpose compute and storage platform
50. The enterprise
How do we pay for all our toys?
●
Support legacy and transitional interfaces
●
●
●
iSCSI, NFS, pNFS, CIFS
Vmware, Hyper-v
Identify the beachhead use-cases
●
Only takes one use-case to get in the door
●
Earn others later
●
Single platform – shared storage resource
●
Bottom-up: earn respect of engineers and admins
●
Top-down: strong brand and compelling product
51. Why we can beat the old guard
●
It is hard to compete with free and open source software
●
Unbeatable value proposition
●
Ultimately a more efficient development model
●
It is hard to manufacture community
●
Strong foundational architecture
●
Native protocols, Linux kernel support
●
●
●
Unencumbered by legacy protocols like NFS
Move beyond traditional client/server model
Ongoing paradigm shift
●
Software defined infrastructure, data center