11. UCSC research grant
●
“Petascale object storage”
●
DOE: LANL, LLNL, Sandia
● Scalability
● Reliability
●
Performance
●
Raw IO bandwidth, metadata ops/sec
● HPC file system workloads
●
Thousands of clients writing to same file, directory
12. Distributed metadata management
●
Innovative design
●
Subtree-based partitioning for locality, efficiency
●
Dynamically adapt to current workload
●
Embedded inodes
●
Prototype simulator in Java (2004)
● First line of Ceph code
●
Summer internship at LLNL
●
High security national lab environment
●
Could write anything, as long as it was OSS
13. The rest of Ceph
●
RADOS – distributed object storage cluster (2005)
● EBOFS – local object storage (2004/2006)
●
CRUSH – hashing for the real world (2005)
● Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
14.
15. Industry black hole
●
Many large storage vendors
●
Proprietary solutions that don't scale well
● Few open source alternatives (2006)
●
Very limited scale, or
●
Limited community and architecture (Lustre)
●
No enterprise feature sets (snapshots, quotas)
●
PhD grads all built interesting systems...
●
...and then went to work for Netapp, DDN, EMC, Veritas.
● They want you, not your project
16. A different path
●
Change the world with open source
●
Do what Linux did to Solaris, Irix, Ultrix, etc.
●
What could go wrong?
●
License
●
GPL, BSD...
●
LGPL: share changes, okay to link to proprietary code
●
Avoid unsavory practices
●
Dual licensing
●
Copyright assignment
21. The kernel client
●
ceph-fuse was limited, not very fast
● Build native Linux kernel implementation
●
Began attending Linux file system developer events (LSF)
●
Early words of encouragement from ex-Lustre devs
●
Engage Linux fs developer community as peer
●
Initial attempts merge rejected by Linus
●
Not sufficient evidence of user demand
●
A few fans and would-be users chimed in...
●
Eventually merged for v2.6.34 (early 2010)
22. Part of a larger ecosystem
●
Ceph need not solve all problems as monolithic stack
● Replaced ebofs object file system with btrfs
●
Same design goals
●
Avoid reinventing the wheel
●
Robust, well-supported, well optimized
●
Kernel-level cache management
●
Copy-on-write, checksumming, other goodness
●
Contributed some early functionality
●
Cloning files
●
Async snapshots
23. Budding community
●
#ceph on irc.oftc.net, ceph-devel@vger.kernel.org
● Many interested users
●
A few developers
● Many fans
●
Too unstable for any real deployments
● Still mostly focused on right architecture and technical
solutions
24. Road to product
●
DreamHost decides to build an S3-compatible object
storage service with Ceph
● Stability
●
Focus on core RADOS, RBD, radosgw
● Paying back some technical debt
●
Build testing automation
●
Code review!
● Expand engineering team
25. The reality
●
Growing incoming commercial interest
●
Early attempts from organizations large and small
●
Difficult to engage with a web hosting company
●
No means to support commercial deployments
● Project needed a company to back it
●
Fund the engineering effort
●
Build and test a product
●
Support users
● Bryan built a framework to spin out of DreamHost
28. Do it right
●
How do we build a strong open source company?
● How do we build a strong open source community?
●
Models?
●
RedHat, Cloudera, MySQL, Canonical, …
● Initial funding from DreamHost, Mark Shuttleworth
29. Goals
●
A stable Ceph release for production deployment
●
DreamObjects
● Lay foundation for widespread adoption
●
Platform support (Ubuntu, Redhat, SuSE)
●
Documentation
●
Build and test infrastructure
●
Build a sales and support organization
● Expand engineering organization
30. Branding
●
Early decision to engage professional agency
●
MetaDesign
● Terms like
●
“Brand core”
●
“Design system”
● Project vs Company
●
Shared / Separate / Shared core
●
Inktank != Ceph
● Aspirational messaging: The Future of Storage
34. Traction
●
Too many production deployments to count
●
We don't know about most of them!
● Too many customers (for me) to count
● Growing partner list
●
Lots of inbound
●
Lots of press and buzz
35. Quality
●
Increased adoption means increased demands on robust
testing
●
Across multiple platforms
● Include platforms we don't like
●
Upgrades
●
Rolling upgrades
●
Inter-version compatibility
● Expanding user community + less noise about bugs = a
good sign
36. Developer community
●
Significant external contributors
● First-class feature contributions from contributors
●
Non-Inktank participants in daily Inktank stand-ups
● External access to build/test lab infrastructure
● Common toolset
●
Github
●
Email (kernel.org)
●
IRC (oftc.net)
● Linux distros
37. CDS: Ceph Developer Summit
●
Community process for building project roadmap
● 100% online
●
Google hangouts
●
Wikis
●
Etherpad
● First was this Spring, second is next week
● Great feedback, growing participation
● Indoctrinating our own developers to an open
development model
39. Governance
How do we strengthen the project community?
●
2014 is the year
● Might formally acknowledge my role as BDL
● Recognized project leads
●
RBD, RGW, RADOS, CephFS)
● Formalize processes around CDS, community roadmap
● External foundation?
40. Technical roadmap
●
How do we reach new use-cases and users
● How do we better satisfy existing users
●
How do we ensure Ceph can succeed in enough markets
for Inktank to thrive
● Enough breadth to expand and grow the community
● Enough focus to do well
41. Tiering
●
Client side caches are great, but only buy so much.
● Can we separate hot and cold data onto different storage
devices?
●
Cache pools: promote hot objects from an existing pool into a fast
(e.g., FusionIO) pool
●
Cold pools: demote cold data to a slow, archival pool (e.g.,
erasure coding)
●
How do you identify what is hot and cold?
● Common in enterprise solutions; not found in open source
scale-out systems
→ key topic at CDS next week
42. Erasure coding
●
Replication for redundancy is flexible and fast
● For larger clusters, it can be expensive
● Erasure coded data is hard to modify, but ideal for cold or
read-only objects
●
Cold storage tiering
●
Will be used directly by radosgw
Storage
overhead
Repair
traffic
MTTDL
(days)
3x replication 3x 1x 2.3 E10
RS (10, 4) 1.4x 10x 3.3 E13
LRC (10, 6, 5) 1.6x 5x 1.2 E15
43. Multi-datacenter, geo-replication
●
Ceph was originally designed for single DC clusters
●
Synchronous replication
●
Strong consistency
●
Growing demand
●
Enterprise: disaster recovery
●
ISPs: replication data across sites for locality
●
Two strategies:
●
use-case specific: radosgw, RBD
●
low-level capability in RADOS
44. RGW: Multi-site and async replication
●
Multi-site, multi-cluster
●
Regions: east coast, west coast, etc.
●
Zones: radosgw sub-cluster(s) within a region
●
Can federate across same or multiple Ceph clusters
●
Sync user and bucket metadata across regions
●
Global bucket/user namespace, like S3
●
Synchronize objects across zones
●
Within the same region
●
Across regions
●
Admin control over which zones are master/slave
45. RBD: simple DR via snapshots
●
Simple backup capability
●
Based on block device snapshots
●
Efficiently mirror changes between consecutive snapshots across
clusters
● Now supported/orchestrated by OpenStack
● Good for coarse synchronization (e.g., hours)
●
Not real-time
46. Async replication in RADOS
●
One implementation to capture multiple use-cases
●
RBD, CephFS, RGW, … RADOS
● A harder problem
●
Scalable: 1000s OSDs 1000s of OSDs→
●
Point-in-time consistency
● Three challenges
●
Infer a partial ordering of events in the cluster
●
Maintain a stable timeline to stream from
– either checkpoints or event stream
●
Coordinated roll-forward at destination
– do not apply any update until we know we have everything that
happened before it
47. CephFS
→ This is where it all started – let's get there
●
Today
●
QA coverage and bug squashing continues
●
NFS and CIFS now large complete and robust
●
Need
●
Multi-MDS
●
Directory fragmentation
●
Snapshots
●
QA investment
● Amazing community effort
49. Big data
When will be stop talking about MapReduce?
Why is “big data” built on such a lame storage model?
● Move computation to the data
● Evangelize RADOS classes
●
librados case studies and proof points
● Build a general purpose compute and storage platform
50. The enterprise
How do we pay for all our toys?
●
Support legacy and transitional interfaces
●
iSCSI, NFS, pNFS, CIFS
●
Vmware, Hyper-v
●
Identify the beachhead use-cases
●
Only takes one use-case to get in the door
●
Earn others later
●
Single platform – shared storage resource
● Bottom-up: earn respect of engineers and admins
●
Top-down: strong brand and compelling product
51. Why we can beat the old guard
●
It is hard to compete with free and open source software
●
Unbeatable value proposition
●
Ultimately a more efficient development model
●
It is hard to manufacture community
● Strong foundational architecture
●
Native protocols, Linux kernel support
●
Unencumbered by legacy protocols like NFS
●
Move beyond traditional client/server model
●
Ongoing paradigm shift
●
Software defined infrastructure, data center