SlideShare a Scribd company logo
1 of 91
Download to read offline
Architecting an Enterprise Storage
Platform Using Object Stores
© mekuria getinet / www.mekuriageti.net
Niraj Tolia
Chief Architect, Maginatics
@nirajtolia
These gray slides are equivalent to speaker notes
Normally invisible, they are provided for non-
presentation settings
Hope they help
A Whirlwind Tour
This presentation provides an end-to-end overview of
MagFS and therefore might not be deep enough in
certain areas
Contact @nirajtolia for Comments, Questions, Flames
Awesome Questions == AwesomeT-shirts
Hacker T-shirts were handed out for “awesome”
questions during the SNIA SDC talk.
If you asked one but didn’t get one, get in touch with us
and we will ship one.
If you missed the talk and still want a T-shirt, come to a
future talk or try MagFS out.
80%YoY Growth in
Unstructured Data
41% Growth in IaaS
Systems through 2016
Sources:
Gartner, IT Marketing Clock for Storage, Sep 2011
Gartner, Forecast Overview: Public Cloud Services, Worldwide, 2011-2016, Feb 2013
Data growth is impressive! Requires centralization for
protection, analysis and cost management.
Infrastructure-as-a-Service systems are rapidly growing.
Apart from leveraging new storage paradigms (object
storage) to deal with this data growth, workloads are
migrating and need to use cloud storage. Storage
systems also need to support elastic workloads
(capacity and scale).
MagFS –The File System for the Cloud
Consistent, Elastic, Secure, Mobile-Enabled
Layered on Object Stores
“Software-Defined”
To respond to the earlier trends, we built a system that, at
its core, is a distributed file system
It differs from legacy systems in a number of ways but
primarily with an end-to-end (E2E) security perspective, the
ability to both be elastic and support elastic workloads, by
elevating mobility to a first-class citizen, and by exploiting
object stores
Further, while “software-defined” is a oft-abused buzzword,
MagFS does fit the definition: software-only, packaged as
VMs, and clean separation of data and control planes
No (Initial) Legacy
Support (NFS/CIFS)
Native Clients: Push
Intelligence to Edges
Strong Consistency w/
Full-Spectrum Caching
Three Early Decisions:
1. No legacy (NFS, CIFS) support on purpose: File systems
must evolve (e.g., dedup, caching, scaling). MagFS
transparently replaces legacy distributed file systems
though.
2. Client agents allows MagFS to push smarts to edges. No
significant IT pushback anymore. Common codebase
reduces development costs.
3. Enable data & metadata caching with strong consistency
File System Design Goals
Low Cost,
High Scale
Intelligent
Clients
Span Devices
and Networks
Support Rapid
Iteration
Design Goals:
1. Deliver scale at a cost-effective point
2. Make clients intelligent: modern computing
platforms have enough horsepower
3. Span server-grade hardware to mobile clients and
from fast to bandwidth-challenged networks
4. To rapidly iterate on our product and add new
features with disruption to users
In-Cloud
File System
NAS Replacement
and Consolidation
Enterprise File
Sharing
Use Cases
MagFS, a general purpose system, is used for many different
use cases.The majority are Tier 2/3 workloads (e.g., home
directory, media, nearline storage, etc.).
In-Cloud File System:Allow unmodified applications to Just
Work™ in the cloud. Provide a distributed file system
where no filer can be racked in.
NAS: Both serve as a more cost-effective filer as well as
allow for globally distributed workforces to leverage our
WAN optimization.
Enterprise File Sharing: Related to NAS, secure file sharing
that meets compliance and regulatory concerns as MagFS is
a product and not a service.
Object Storage
(public, on-premises,or hybrid)
Data
Metadata
Metadata Servers
Clients
10,000 FootView
The previous slide presents a very high-level overview
of MagFS
Note the split data and metadata planes: MagFS does
not try to resolve scalability issues already tackled by
the object storage system and therefore will not
intercept data on the fast path
The metadata servers provide a single pane-of-glass for
admins, integrate with native AD or LDAP setups, and
also store encryption keys
Koukouvaya / flickr.com/photos/jackoughton/6535137981/
Heavy (Data) Lifting via Clients
Encryption
Inline Deduplication
Compression
Persistent Data Caching
Bulk DataTransfers
Push a lot of smarts to increasingly-powerful clients
Clients do heavy data lifting: Chunking for deduplication,
encryption, optional compression, on-disk caching, etc.
Available resources generally proportional to
workloads for different device types
Server doesn’t see data on read OR write path!
Cloud Object Storage
Scale Out, Low Cost
Handles Placement + Replication
Tolerates Failures
High Aggregate Performance
Object Storage has a number of very useful properties:
Cost, Commodity, Scale Out (aggregate performance,
fault tolerance, etc.)
We directly expose clients to the object store
Similar to clients, we also push functionality to the
object storage system: data placement and replication,
fault-tolerance, repairs, etc. as we do not want to
reinvent the wheel
Virtualized Metadata Servers
Enforce Strong Consistency
Enforce Authentication and Integrity
Runtime Performance Optimization
Share-level Deduplication
Data Scrubbing & Garbage Collection
TheVM-based metadata servers are where consistency and
user authentication are enforced
They also allow clients to dynamically cache read and write
data, lock objects and byte ranges, etc.
Works with clients to prevent duplicated data transfers or
redundant data copies
Data is scrubbed and unused data deleted in the background
Architecture
We will now branch off into details about the client and
server architecture and how they interact with object
storage
Client
Architecture
MagFS supports different Linux,Windows, OS X,
Android, and iOS versions
Majority of code is shared across platforms with
platform-specific glue layers
The next few slides talk about desktop/server platforms
but the same structure applies to all.
Client Architecture
Application
Redirector
(e.g., FUSE)
File System
OS Glue
Data Manager
MetadataTransport
Layer
Local Remote
Userspace
Kernel
Deduplication Encryption Compression
Locking Leases
Traditional platforms have a thin in-kernel redirector (FUSE
on Linux.We ship the equivalent onWindows and OS X)
Modulo glue, the file system layer contains core functionality
Data manager used for local persistent data caching and
optimized remote object store fetches
Metadata transport layer manages the MagFS control plane
Data Manager
File System Layer
SimplifiedWrite: Deduplication + Encryption
Write Request
Plaintext
Variable-Length
Chunking
Encrypted Text (E)
AES-256 (K)
Object Name (N)
SHA-256
Local Cache Remote Transfer
Encryption Key (K)
SHA-256
Very simple example! In reality, most operations are
not synchronous, are batched, and clients get ack early
Incoming data is broken up into smaller variable-length
chunks for deduplication
Per-chunk encryption used where the per-chunk key is
derived from a cryptographic hash of unencrypted data
Chunk name derived from hash of encrypted data
Data Manager
File System Layer
SimplifiedWrite: Deduplication + Encryption
Write Request
Plaintext
Variable-Length
Chunking
Encrypted Text (E)
AES-256 (K)
Object Name (N)
SHA-256
<File, Offset, N, K>
Optional(<URI>)
Local Cache Remote Transfer
<N, E>
<URI, E>
No Encryption Keys
in the Cloud
No Encryption Keys
in Local Cache
Encryption Key (K)
SHA-256
<E>
Encrypted data (but not key) is written to local cache
Write request with offset, chunk name, and encryption
key is made to the server
If new chunk, a secure write URI is sent to the client
Data manager queues and writes chunk to the cloud
No encryption keys in local cache or object store
Data Manager
File System Layer
Simplified Read: Deduplication + Encryption
Read Request
<File, Offset, Range>
Local Cache Remote Transfer
<N, URI>
Encryption Key (K)
<N, K, URI>
Encrypted Text (E)
<E>
<URI>
<E>
<URI>
<E>
Plaintext
AES-256 (K)
Another very simple example. Does not include
metadata caching either.
Server responds to a read request with the chunk
name, decryption key, and secure read URI
A local cache miss causes an object storage fetch.
Encrypted chunk is decrypted using the server-provided
key and unencrypted data returned to the application.
All deduplication and encryption is always transparent
to the application.
The Client in Real Life Does a Lot More!
• File and Directory Leases (data and metadata caching)
• Asynchronous Operations (including writes)
• Operation Compounding
• Runtime Optimizations (e.g., read ahead)
• Optimizing for High Bandwidth Delay Product (BDP)
• …
There is a separate discussion on leases later when we
talk about how clients and servers optimize
performance at runtime
Object Storage
(public, on-premises,or hybrid)
Data
Metadata
Metadata Servers
Clients
Communication Details
Thrift
(HTTPS)
REST
(HTTPS)
Important: Split Data and Metadata paths (always, not
optional). Clients directly access the object store. MagFS
does not need to scale the data plane.
Client technically speaks REST over HTTPS to the object
store but has no knowledge of the actual API (server-
provided URIs)
The MagFS protocol uses Thrift over HTTPS (firewall and
proxy friendly). Enables efficient encoding and easy protocol
extension without breaking compatibility.
Server
Architecture
The next file slides cover how we virtualize file
namespaces, the distributed system deployment, a view
into internals, and a brief overview of leases
Metadata Server Internals
Metadata Storage Layer
Storage Core
Backups
Production Development
GC
Scrubbing
Quotas Dedup Leases Security
HA
MagFS
Ext. Sharing
Multi-Cloud Versioning Offline Mode
Cloud Abstraction Layer
Legend
The metadata server internals have been modularized to
provide both development and runtime agility
For example, adding support for a new object storage
system doesn’t impact the rest of the code
Runtime background operations (e.g., hot backups, garbage
collection, scrubbing) do not impact clients.
The file system protocol is separate from file system-
agnostic features (e.g., quotas, lease, and lock management)
Bootstrapping:Virtualized Namespaces
server.example.comshare
HOST FQDN FOLDER
Legacy
server.example.comshare
MagFS
Dynamic mapping to host:port
With both Window UNC paths or NFS server/share
exports, the exported file system would be tied to a
DNS name.
Instead, MagFS virtualizes the access path. Nothing
changes with respect to applications but a virtualized
server:share combination can map to any host:port
This is extremely useful for High Availability Failover
and Disaster Recovery
Discovery Service
Metadata
Server
Metadata
Server (HA)
Metadata
Server
ZooKeeper
ZooKeeperZooKeeper
Monitoring
Management
Console
Config +
Scheduler
Virtual Filer  Host:Port Mapping
MagFS is a distributed system. It has a number of
backend services:VM and Service Monitoring,
ZooKeeper for server registration and discovery,Admin
management console, job scheduler,AD integration, etc.
Shares are deployed in HA or non-HA configuration.
HA comes with automatic failover.
Clients use a discovery service to map namespace to
server
One of the big challenges in any distributed file system
is the tradeoff between consistency and performance.
In a naïve strongly consistent system, every operation
needs to be centralized on a server.This is obviously
bad for performance.
The MagFS metadata server therefore hands leases out
to clients for data and metadata caching (including
caching writes and updates)
Leases: Performance and Strong Consistency
Read Write HandleLeaseTypes
Read
Read +
Handle
Read +
Write +
Handle
Lease States
Valid File Leases
Valid Directory Leases
Lease Types: READ allows client to cache reads locally,
WRITE allows local write caching, and HANDLE where
files can be closed and reopened locally
Valid Lease Type combinations are: READ, READ +
HANDLE, READ + WRITE + HANDLE. Others don’t
really apply (e.g,WRITE is exclusive and READ +
HANDLE come for free if a WRITE lease is held)
MagFS also supports WRITE directory leases
Cloud Storage
Interaction
While Maginatics does not provide an Object Storage
system itself, it works with a number of different
products.The next few slides will talk about the
challenges of interoperating with a large number of
systems as well the technical challenges of layering a file
system on top of them.
Object Storage
(public, on-premises,or hybrid)
Today, MagFS supports a large number of object storage
systems: private and public Swift and Atmos
deployments,AWS S3, public and private S3 clones,
Azure, and others not mentioned here
We are seeing an increasing shift towards vendors
providing S3 and Swift API compatibility layers even if
they originally had their own REST-style protocols
Object Storage systems
are like snowflakes!
MagFS also works hard to address inter-object store
variance and hide the complexity from the end user.
MagFS uses very basic API calls (GET/PUT/DELETE
object/bucket and Signed URLs) and we discovered a
number of differences in vendor implementations
MagFS also optimizes data layout for different object
stores to obtain the best performance. For example,
data layout on S3,Atmos, and Swift differs to match the
underlying platform.
Object Store API Compatibility
Q: Has anyone come across a near 100%
Amazon S3 API compatible object storage
system?
A: It is hard to find a near-100% compatible
product…
-Vendor w/ S3 Compatible Product
Even vendors claiming to support the same API have
differences, bugs, or interpretation differences. For
example, most S3 compatible systems we have added
support is different from one another (e.g., subsets of
API supported, differing API interpretations, bugs, etc.).
Swift is similar.The same code cannot be used with a
generic Swift setup and the public cloud providers that
are based on Swift. Swift authentication (Keystone,
TempAuth, etc.) also differs between vendors.
Object Storage
(public, on-premises,or hybrid)
Data
Metadata
Metadata Servers
Clients
Direct Client Access: Security Problem?
One of the challenges with providing clients direct
object store access is security.There is generally one
(or few) master API key(s) that can delete or read
arbitrary data.
However, as different MagFS users have different access
rights to files, we should not provide the master key to
clients (even though the data is encrypted).
Further, a malicious client would be able to wipe all data
with the master key!
Request Signing
The solution to providing secure and time-limited data
access to clients is to use Request Signing, a feature
found in all mature object storage systems today.
The next few slides will walk through an example of
how Request Signing works for a write.
Server-Driven Request Signing
SignString = HTTP-Verb + "n"
+ Content-MD5 + "n"
+ Content-Type + "n"
+ Date + "n"
+ Resource + "n"
+ ...
Client read or write requests are authorized by the
MagFS server that shares the master key with the
object storage system
Signing is done by the metadata server creating a
request string in a pre-defined order
Server-Driven Request Signing
SignString = PUT + "n"
+ Content-MD5 + "n"
+ Content-Type + "n"
+ Date + "n"
+ Resource + "n"
+ ...
The first component of the signature string is the HTTP
verb used.This would be GET for a read and generally
PUT for a write (some providers like Atmos use POST).
DELETEs are never performed by the client.
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ Content-Type + "n"
+ Date + "n"
+ Resource + "n"
+ ...
The second component is a cryptographic hash of the
data.A number of object storage systems will reject
data whose cryptographic hash doesn’t match the
request.This is useful to protect against TCP errors that
the TCP checksum doesn’t catch, buggy clients, and
even malicious clients.
A common hash algorithm used at this step is MD5 but
some object storage systems are now supporting
stronger cryptographic algorithms
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ image/jpeg + "n"
+ Date + "n"
+ Resource + "n"
+ ...
The next component is the content-type of the object.
We are using the JPEG type in this example but, in
MagFS, this would be “application/octet-stream” for all
our objects as they are encrypted binary data.
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ image/jpeg + "n"
+ Tue, 11 Jun 2013 00:27:41 + "n"
+ Resource + "n"
+ ...
Following the content-type, we now add a timestamp
field.This is very useful because it puts a time limit on
this request to prevent replay attacks.
Most object stores place a reasonable time limit on
request validity (e.g., 15 minutes) but a number also
allow configurable values. MagFS supports both.
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ image/jpeg + "n"
+ Tue, 11 Jun 2013 00:27:41 + "n"
+ /container/example.jpeg + "n"
+ ...
The final component in this example is the resource
name and this includes both the container name and
the object name within the container
More options are possible in signature strings and these
options differ from provider to provider
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ image/jpeg + "n"
+ Tue, 11 Jun 2013 00:27:41 + "n"
+ /container/example.jpeg + "n"
+ ...
HMAC-SHA1( , SignString)
Following the construction of the signature string, a
keyed hash message authentication code (HMAC) is
generated using the signature string and the master key
This is a one-way transform and obtaining the HMAC
value does not leak information about the master key
Server-Driven Request Signing
SignString = PUT + "n"
+ 07BzhNET7exJ6qYjitX/AA== + "n"
+ image/jpeg + "n"
+ Tue, 11 Jun 2013 00:27:41 + "n"
+ /container/example.jpeg + "n"
+ ...
Signature = Base64(HMAC-SHA1( , SignString))
A Base64 encoded representation (signature) of this
HMAC is sent to the client to prove that this request
was authorized by the server
Object Storage
(public, on-premises,or hybrid)
Data
Metadata
Metadata Servers
Clients
Safe Direct Client Access via Request Signing
1. Read/Write Request
3. HTTP Request +
Signature +
Encrypted Data
2. HTTP Request + Signature
To summarize, read or write operations not serviced
from the local cache requires server authorization
Using the server-provided request and signature, a
client can safely read and write data but only for the
specified object
The object store recalculates the signature based on
the request, compares it to the received signature, and
reject the request in case of a mismatch (e.g., wrong
HTTP verb, stale/old request, swapped object names)
Dealing with Lost Client Writes
• Clients can lose connectivity or, in the worst case, be malicious
• Naïvely trusting client writes can “corrupt” w/ global dedup
• MagFS server scrubs all writes:
• Client acknowledges write
• Server verifies object existence (object store performed MD5 at PUT)
• Server can also read and verify object data (stronger SHA-256 check)
• The object will be available for deduplication only after scrubbing
MagFS exposes global deduplication and therefore needs to
handle buggy or malicious clients that might have claimed to
have written data but did not
The server therefore waits for a client to acknowledge the
write, checks the object store to verify that the object was
written (implies success for the cryptographic hash check),
and can optionally scrub the data using a stronger
cryptographic hash.
Modulo optimizations for the same client (really user), the
data is only used for deduplication after scrubbing.
Handling Object Store Eventual Consistency
• Treat objects as immutable (even if modifications are allowed)
• Use content-based names (generated using cryptographic hashes)
• Tombstone names after Garbage Collection
• Suffix generation number to content-based names in case of resurrection
Some object stores have eventually consistent
properties and can lead to interesting read-after-write
behaviors where what you read might not be the most
recent write.
To address this, we treat all objects as immutable, use
content-based names, and using a suffix-based method
to tombstone names so that they are never reused
AWS S3 supporting read-after-first-put consistency in
most regions also really helps with the above scheme
Security
Architecture
In theory, this is where we would discuss MagFS’s
security architecture. However, as you observed,
security is baked into the product at every level and has
been covered throughout the deck.We will therefore
only recap here.
Recap: On-Premises Security Model
• User authentication and permissions derived from native Active
Directory setup
• Encryption keys are never exposed to the cloud
• Data and metadata is always encrypted:At-Rest and In-Flight
Quick point about Active Directory (AD):The fact that
all our user permissions, group membership
information, and other authentication information is
derived from AD makes it very simple for admins and
using MagFS does not change their workflows.
Slides (with speaker notes) at http://tolia.org
Try MagFS at http://maginatics.com

More Related Content

What's hot

EOUG95 - Client Server Very Large Databases - Presentation
EOUG95 - Client Server Very Large Databases - PresentationEOUG95 - Client Server Very Large Databases - Presentation
EOUG95 - Client Server Very Large Databases - Presentation
David Walker
 
Multi-Tenant SOA Middleware for Cloud Computing
Multi-Tenant SOA Middleware for Cloud ComputingMulti-Tenant SOA Middleware for Cloud Computing
Multi-Tenant SOA Middleware for Cloud Computing
Srinath Perera
 
Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdution
Aditya Trivedi
 

What's hot (20)

EOUG95 - Client Server Very Large Databases - Presentation
EOUG95 - Client Server Very Large Databases - PresentationEOUG95 - Client Server Very Large Databases - Presentation
EOUG95 - Client Server Very Large Databases - Presentation
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Big Data Glossary of terms
Big Data Glossary of termsBig Data Glossary of terms
Big Data Glossary of terms
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
An efficient concurrent access on cloud database using secureDBAAS
An efficient concurrent access on cloud database using secureDBAASAn efficient concurrent access on cloud database using secureDBAAS
An efficient concurrent access on cloud database using secureDBAAS
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Queues, Pools and Caches - Paper
Queues, Pools and Caches - PaperQueues, Pools and Caches - Paper
Queues, Pools and Caches - Paper
 
Multi-Tenant SOA Middleware for Cloud Computing
Multi-Tenant SOA Middleware for Cloud ComputingMulti-Tenant SOA Middleware for Cloud Computing
Multi-Tenant SOA Middleware for Cloud Computing
 
Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdution
 
Service Mesh Talk for CTO Forum
Service Mesh Talk for CTO ForumService Mesh Talk for CTO Forum
Service Mesh Talk for CTO Forum
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active Archive
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Data management in cloud computing trainee
Data management in cloud computing  traineeData management in cloud computing  trainee
Data management in cloud computing trainee
 
Queues, Pools and Caches paper
Queues, Pools and Caches paperQueues, Pools and Caches paper
Queues, Pools and Caches paper
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
 

Viewers also liked (7)

1st Web Audit Report
1st Web Audit Report1st Web Audit Report
1st Web Audit Report
 
ToursJUG-Maven 3.x, will it lives up to its promises
ToursJUG-Maven 3.x, will it lives up to its promisesToursJUG-Maven 3.x, will it lives up to its promises
ToursJUG-Maven 3.x, will it lives up to its promises
 
NM2216 DW3_1
NM2216 DW3_1NM2216 DW3_1
NM2216 DW3_1
 
Final%20project
Final%20projectFinal%20project
Final%20project
 
BordeauxJUG-Maven 3.x, will it lives up to its promises
BordeauxJUG-Maven 3.x, will it lives up to its promisesBordeauxJUG-Maven 3.x, will it lives up to its promises
BordeauxJUG-Maven 3.x, will it lives up to its promises
 
Ingles 5 trabajo final urbe
Ingles 5 trabajo final urbeIngles 5 trabajo final urbe
Ingles 5 trabajo final urbe
 
Ec big summit-portland-20100615
Ec big summit-portland-20100615Ec big summit-portland-20100615
Ec big summit-portland-20100615
 

Similar to (Speaker Notes Version) Architecting An Enterprise Storage Platform Using Object Stores

Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storage
Khaja Moiz Uddin
 

Similar to (Speaker Notes Version) Architecting An Enterprise Storage Platform Using Object Stores (20)

Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the Cloud
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storage
 
IRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFSIRJET- Distributed Decentralized Data Storage using IPFS
IRJET- Distributed Decentralized Data Storage using IPFS
 
A cloud environment for backup and data storage
A cloud environment for backup and data storageA cloud environment for backup and data storage
A cloud environment for backup and data storage
 
Ogf2008 Grid Data Caching
Ogf2008 Grid Data CachingOgf2008 Grid Data Caching
Ogf2008 Grid Data Caching
 
A cloud enviroment for backup and data storage
A cloud enviroment for backup and data storageA cloud enviroment for backup and data storage
A cloud enviroment for backup and data storage
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Declare Victory with Big Data
Declare Victory with Big DataDeclare Victory with Big Data
Declare Victory with Big Data
 
Google File System
Google File SystemGoogle File System
Google File System
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 

More from Niraj Tolia (6)

Alcion - Culture Deck
Alcion - Culture DeckAlcion - Culture Deck
Alcion - Culture Deck
 
Alcion Culture Deck
Alcion Culture DeckAlcion Culture Deck
Alcion Culture Deck
 
Kasten Engineering Culture Deck
Kasten Engineering Culture DeckKasten Engineering Culture Deck
Kasten Engineering Culture Deck
 
Kasten Culture Deck
Kasten Culture DeckKasten Culture Deck
Kasten Culture Deck
 
Architecting An Enterprise Storage Platform Using Object Stores
Architecting An Enterprise Storage Platform Using Object StoresArchitecting An Enterprise Storage Platform Using Object Stores
Architecting An Enterprise Storage Platform Using Object Stores
 
Working at Maginatics
Working at MaginaticsWorking at Maginatics
Working at Maginatics
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Object Stores

  • 1. Architecting an Enterprise Storage Platform Using Object Stores © mekuria getinet / www.mekuriageti.net Niraj Tolia Chief Architect, Maginatics @nirajtolia
  • 2. These gray slides are equivalent to speaker notes Normally invisible, they are provided for non- presentation settings Hope they help
  • 4. This presentation provides an end-to-end overview of MagFS and therefore might not be deep enough in certain areas Contact @nirajtolia for Comments, Questions, Flames
  • 5. Awesome Questions == AwesomeT-shirts
  • 6. Hacker T-shirts were handed out for “awesome” questions during the SNIA SDC talk. If you asked one but didn’t get one, get in touch with us and we will ship one. If you missed the talk and still want a T-shirt, come to a future talk or try MagFS out.
  • 7. 80%YoY Growth in Unstructured Data 41% Growth in IaaS Systems through 2016 Sources: Gartner, IT Marketing Clock for Storage, Sep 2011 Gartner, Forecast Overview: Public Cloud Services, Worldwide, 2011-2016, Feb 2013
  • 8. Data growth is impressive! Requires centralization for protection, analysis and cost management. Infrastructure-as-a-Service systems are rapidly growing. Apart from leveraging new storage paradigms (object storage) to deal with this data growth, workloads are migrating and need to use cloud storage. Storage systems also need to support elastic workloads (capacity and scale).
  • 9. MagFS –The File System for the Cloud Consistent, Elastic, Secure, Mobile-Enabled Layered on Object Stores “Software-Defined”
  • 10. To respond to the earlier trends, we built a system that, at its core, is a distributed file system It differs from legacy systems in a number of ways but primarily with an end-to-end (E2E) security perspective, the ability to both be elastic and support elastic workloads, by elevating mobility to a first-class citizen, and by exploiting object stores Further, while “software-defined” is a oft-abused buzzword, MagFS does fit the definition: software-only, packaged as VMs, and clean separation of data and control planes
  • 11. No (Initial) Legacy Support (NFS/CIFS) Native Clients: Push Intelligence to Edges Strong Consistency w/ Full-Spectrum Caching
  • 12. Three Early Decisions: 1. No legacy (NFS, CIFS) support on purpose: File systems must evolve (e.g., dedup, caching, scaling). MagFS transparently replaces legacy distributed file systems though. 2. Client agents allows MagFS to push smarts to edges. No significant IT pushback anymore. Common codebase reduces development costs. 3. Enable data & metadata caching with strong consistency
  • 13. File System Design Goals Low Cost, High Scale Intelligent Clients Span Devices and Networks Support Rapid Iteration
  • 14. Design Goals: 1. Deliver scale at a cost-effective point 2. Make clients intelligent: modern computing platforms have enough horsepower 3. Span server-grade hardware to mobile clients and from fast to bandwidth-challenged networks 4. To rapidly iterate on our product and add new features with disruption to users
  • 15. In-Cloud File System NAS Replacement and Consolidation Enterprise File Sharing Use Cases
  • 16. MagFS, a general purpose system, is used for many different use cases.The majority are Tier 2/3 workloads (e.g., home directory, media, nearline storage, etc.). In-Cloud File System:Allow unmodified applications to Just Work™ in the cloud. Provide a distributed file system where no filer can be racked in. NAS: Both serve as a more cost-effective filer as well as allow for globally distributed workforces to leverage our WAN optimization. Enterprise File Sharing: Related to NAS, secure file sharing that meets compliance and regulatory concerns as MagFS is a product and not a service.
  • 17. Object Storage (public, on-premises,or hybrid) Data Metadata Metadata Servers Clients 10,000 FootView
  • 18. The previous slide presents a very high-level overview of MagFS Note the split data and metadata planes: MagFS does not try to resolve scalability issues already tackled by the object storage system and therefore will not intercept data on the fast path The metadata servers provide a single pane-of-glass for admins, integrate with native AD or LDAP setups, and also store encryption keys
  • 19. Koukouvaya / flickr.com/photos/jackoughton/6535137981/ Heavy (Data) Lifting via Clients Encryption Inline Deduplication Compression Persistent Data Caching Bulk DataTransfers
  • 20. Push a lot of smarts to increasingly-powerful clients Clients do heavy data lifting: Chunking for deduplication, encryption, optional compression, on-disk caching, etc. Available resources generally proportional to workloads for different device types Server doesn’t see data on read OR write path!
  • 21. Cloud Object Storage Scale Out, Low Cost Handles Placement + Replication Tolerates Failures High Aggregate Performance
  • 22. Object Storage has a number of very useful properties: Cost, Commodity, Scale Out (aggregate performance, fault tolerance, etc.) We directly expose clients to the object store Similar to clients, we also push functionality to the object storage system: data placement and replication, fault-tolerance, repairs, etc. as we do not want to reinvent the wheel
  • 23. Virtualized Metadata Servers Enforce Strong Consistency Enforce Authentication and Integrity Runtime Performance Optimization Share-level Deduplication Data Scrubbing & Garbage Collection
  • 24. TheVM-based metadata servers are where consistency and user authentication are enforced They also allow clients to dynamically cache read and write data, lock objects and byte ranges, etc. Works with clients to prevent duplicated data transfers or redundant data copies Data is scrubbed and unused data deleted in the background
  • 26. We will now branch off into details about the client and server architecture and how they interact with object storage
  • 28. MagFS supports different Linux,Windows, OS X, Android, and iOS versions Majority of code is shared across platforms with platform-specific glue layers The next few slides talk about desktop/server platforms but the same structure applies to all.
  • 29. Client Architecture Application Redirector (e.g., FUSE) File System OS Glue Data Manager MetadataTransport Layer Local Remote Userspace Kernel Deduplication Encryption Compression Locking Leases
  • 30. Traditional platforms have a thin in-kernel redirector (FUSE on Linux.We ship the equivalent onWindows and OS X) Modulo glue, the file system layer contains core functionality Data manager used for local persistent data caching and optimized remote object store fetches Metadata transport layer manages the MagFS control plane
  • 31. Data Manager File System Layer SimplifiedWrite: Deduplication + Encryption Write Request Plaintext Variable-Length Chunking Encrypted Text (E) AES-256 (K) Object Name (N) SHA-256 Local Cache Remote Transfer Encryption Key (K) SHA-256
  • 32. Very simple example! In reality, most operations are not synchronous, are batched, and clients get ack early Incoming data is broken up into smaller variable-length chunks for deduplication Per-chunk encryption used where the per-chunk key is derived from a cryptographic hash of unencrypted data Chunk name derived from hash of encrypted data
  • 33. Data Manager File System Layer SimplifiedWrite: Deduplication + Encryption Write Request Plaintext Variable-Length Chunking Encrypted Text (E) AES-256 (K) Object Name (N) SHA-256 <File, Offset, N, K> Optional(<URI>) Local Cache Remote Transfer <N, E> <URI, E> No Encryption Keys in the Cloud No Encryption Keys in Local Cache Encryption Key (K) SHA-256 <E>
  • 34. Encrypted data (but not key) is written to local cache Write request with offset, chunk name, and encryption key is made to the server If new chunk, a secure write URI is sent to the client Data manager queues and writes chunk to the cloud No encryption keys in local cache or object store
  • 35. Data Manager File System Layer Simplified Read: Deduplication + Encryption Read Request <File, Offset, Range> Local Cache Remote Transfer <N, URI> Encryption Key (K) <N, K, URI> Encrypted Text (E) <E> <URI> <E> <URI> <E> Plaintext AES-256 (K)
  • 36. Another very simple example. Does not include metadata caching either. Server responds to a read request with the chunk name, decryption key, and secure read URI A local cache miss causes an object storage fetch. Encrypted chunk is decrypted using the server-provided key and unencrypted data returned to the application. All deduplication and encryption is always transparent to the application.
  • 37. The Client in Real Life Does a Lot More! • File and Directory Leases (data and metadata caching) • Asynchronous Operations (including writes) • Operation Compounding • Runtime Optimizations (e.g., read ahead) • Optimizing for High Bandwidth Delay Product (BDP) • …
  • 38. There is a separate discussion on leases later when we talk about how clients and servers optimize performance at runtime
  • 39. Object Storage (public, on-premises,or hybrid) Data Metadata Metadata Servers Clients Communication Details Thrift (HTTPS) REST (HTTPS)
  • 40. Important: Split Data and Metadata paths (always, not optional). Clients directly access the object store. MagFS does not need to scale the data plane. Client technically speaks REST over HTTPS to the object store but has no knowledge of the actual API (server- provided URIs) The MagFS protocol uses Thrift over HTTPS (firewall and proxy friendly). Enables efficient encoding and easy protocol extension without breaking compatibility.
  • 42. The next file slides cover how we virtualize file namespaces, the distributed system deployment, a view into internals, and a brief overview of leases
  • 43. Metadata Server Internals Metadata Storage Layer Storage Core Backups Production Development GC Scrubbing Quotas Dedup Leases Security HA MagFS Ext. Sharing Multi-Cloud Versioning Offline Mode Cloud Abstraction Layer Legend
  • 44. The metadata server internals have been modularized to provide both development and runtime agility For example, adding support for a new object storage system doesn’t impact the rest of the code Runtime background operations (e.g., hot backups, garbage collection, scrubbing) do not impact clients. The file system protocol is separate from file system- agnostic features (e.g., quotas, lease, and lock management)
  • 45. Bootstrapping:Virtualized Namespaces server.example.comshare HOST FQDN FOLDER Legacy server.example.comshare MagFS Dynamic mapping to host:port
  • 46. With both Window UNC paths or NFS server/share exports, the exported file system would be tied to a DNS name. Instead, MagFS virtualizes the access path. Nothing changes with respect to applications but a virtualized server:share combination can map to any host:port This is extremely useful for High Availability Failover and Disaster Recovery
  • 48. MagFS is a distributed system. It has a number of backend services:VM and Service Monitoring, ZooKeeper for server registration and discovery,Admin management console, job scheduler,AD integration, etc. Shares are deployed in HA or non-HA configuration. HA comes with automatic failover. Clients use a discovery service to map namespace to server
  • 49.
  • 50. One of the big challenges in any distributed file system is the tradeoff between consistency and performance. In a naïve strongly consistent system, every operation needs to be centralized on a server.This is obviously bad for performance. The MagFS metadata server therefore hands leases out to clients for data and metadata caching (including caching writes and updates)
  • 51. Leases: Performance and Strong Consistency Read Write HandleLeaseTypes Read Read + Handle Read + Write + Handle Lease States Valid File Leases Valid Directory Leases
  • 52. Lease Types: READ allows client to cache reads locally, WRITE allows local write caching, and HANDLE where files can be closed and reopened locally Valid Lease Type combinations are: READ, READ + HANDLE, READ + WRITE + HANDLE. Others don’t really apply (e.g,WRITE is exclusive and READ + HANDLE come for free if a WRITE lease is held) MagFS also supports WRITE directory leases
  • 54. While Maginatics does not provide an Object Storage system itself, it works with a number of different products.The next few slides will talk about the challenges of interoperating with a large number of systems as well the technical challenges of layering a file system on top of them.
  • 56. Today, MagFS supports a large number of object storage systems: private and public Swift and Atmos deployments,AWS S3, public and private S3 clones, Azure, and others not mentioned here We are seeing an increasing shift towards vendors providing S3 and Swift API compatibility layers even if they originally had their own REST-style protocols
  • 57. Object Storage systems are like snowflakes!
  • 58. MagFS also works hard to address inter-object store variance and hide the complexity from the end user. MagFS uses very basic API calls (GET/PUT/DELETE object/bucket and Signed URLs) and we discovered a number of differences in vendor implementations MagFS also optimizes data layout for different object stores to obtain the best performance. For example, data layout on S3,Atmos, and Swift differs to match the underlying platform.
  • 59. Object Store API Compatibility Q: Has anyone come across a near 100% Amazon S3 API compatible object storage system? A: It is hard to find a near-100% compatible product… -Vendor w/ S3 Compatible Product
  • 60. Even vendors claiming to support the same API have differences, bugs, or interpretation differences. For example, most S3 compatible systems we have added support is different from one another (e.g., subsets of API supported, differing API interpretations, bugs, etc.). Swift is similar.The same code cannot be used with a generic Swift setup and the public cloud providers that are based on Swift. Swift authentication (Keystone, TempAuth, etc.) also differs between vendors.
  • 61. Object Storage (public, on-premises,or hybrid) Data Metadata Metadata Servers Clients Direct Client Access: Security Problem?
  • 62. One of the challenges with providing clients direct object store access is security.There is generally one (or few) master API key(s) that can delete or read arbitrary data. However, as different MagFS users have different access rights to files, we should not provide the master key to clients (even though the data is encrypted). Further, a malicious client would be able to wipe all data with the master key!
  • 64. The solution to providing secure and time-limited data access to clients is to use Request Signing, a feature found in all mature object storage systems today. The next few slides will walk through an example of how Request Signing works for a write.
  • 65. Server-Driven Request Signing SignString = HTTP-Verb + "n" + Content-MD5 + "n" + Content-Type + "n" + Date + "n" + Resource + "n" + ...
  • 66. Client read or write requests are authorized by the MagFS server that shares the master key with the object storage system Signing is done by the metadata server creating a request string in a pre-defined order
  • 67. Server-Driven Request Signing SignString = PUT + "n" + Content-MD5 + "n" + Content-Type + "n" + Date + "n" + Resource + "n" + ...
  • 68. The first component of the signature string is the HTTP verb used.This would be GET for a read and generally PUT for a write (some providers like Atmos use POST). DELETEs are never performed by the client.
  • 69. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + Content-Type + "n" + Date + "n" + Resource + "n" + ...
  • 70. The second component is a cryptographic hash of the data.A number of object storage systems will reject data whose cryptographic hash doesn’t match the request.This is useful to protect against TCP errors that the TCP checksum doesn’t catch, buggy clients, and even malicious clients. A common hash algorithm used at this step is MD5 but some object storage systems are now supporting stronger cryptographic algorithms
  • 71. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + image/jpeg + "n" + Date + "n" + Resource + "n" + ...
  • 72. The next component is the content-type of the object. We are using the JPEG type in this example but, in MagFS, this would be “application/octet-stream” for all our objects as they are encrypted binary data.
  • 73. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + image/jpeg + "n" + Tue, 11 Jun 2013 00:27:41 + "n" + Resource + "n" + ...
  • 74. Following the content-type, we now add a timestamp field.This is very useful because it puts a time limit on this request to prevent replay attacks. Most object stores place a reasonable time limit on request validity (e.g., 15 minutes) but a number also allow configurable values. MagFS supports both.
  • 75. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + image/jpeg + "n" + Tue, 11 Jun 2013 00:27:41 + "n" + /container/example.jpeg + "n" + ...
  • 76. The final component in this example is the resource name and this includes both the container name and the object name within the container More options are possible in signature strings and these options differ from provider to provider
  • 77. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + image/jpeg + "n" + Tue, 11 Jun 2013 00:27:41 + "n" + /container/example.jpeg + "n" + ... HMAC-SHA1( , SignString)
  • 78. Following the construction of the signature string, a keyed hash message authentication code (HMAC) is generated using the signature string and the master key This is a one-way transform and obtaining the HMAC value does not leak information about the master key
  • 79. Server-Driven Request Signing SignString = PUT + "n" + 07BzhNET7exJ6qYjitX/AA== + "n" + image/jpeg + "n" + Tue, 11 Jun 2013 00:27:41 + "n" + /container/example.jpeg + "n" + ... Signature = Base64(HMAC-SHA1( , SignString))
  • 80. A Base64 encoded representation (signature) of this HMAC is sent to the client to prove that this request was authorized by the server
  • 81. Object Storage (public, on-premises,or hybrid) Data Metadata Metadata Servers Clients Safe Direct Client Access via Request Signing 1. Read/Write Request 3. HTTP Request + Signature + Encrypted Data 2. HTTP Request + Signature
  • 82. To summarize, read or write operations not serviced from the local cache requires server authorization Using the server-provided request and signature, a client can safely read and write data but only for the specified object The object store recalculates the signature based on the request, compares it to the received signature, and reject the request in case of a mismatch (e.g., wrong HTTP verb, stale/old request, swapped object names)
  • 83. Dealing with Lost Client Writes • Clients can lose connectivity or, in the worst case, be malicious • Naïvely trusting client writes can “corrupt” w/ global dedup • MagFS server scrubs all writes: • Client acknowledges write • Server verifies object existence (object store performed MD5 at PUT) • Server can also read and verify object data (stronger SHA-256 check) • The object will be available for deduplication only after scrubbing
  • 84. MagFS exposes global deduplication and therefore needs to handle buggy or malicious clients that might have claimed to have written data but did not The server therefore waits for a client to acknowledge the write, checks the object store to verify that the object was written (implies success for the cryptographic hash check), and can optionally scrub the data using a stronger cryptographic hash. Modulo optimizations for the same client (really user), the data is only used for deduplication after scrubbing.
  • 85. Handling Object Store Eventual Consistency • Treat objects as immutable (even if modifications are allowed) • Use content-based names (generated using cryptographic hashes) • Tombstone names after Garbage Collection • Suffix generation number to content-based names in case of resurrection
  • 86. Some object stores have eventually consistent properties and can lead to interesting read-after-write behaviors where what you read might not be the most recent write. To address this, we treat all objects as immutable, use content-based names, and using a suffix-based method to tombstone names so that they are never reused AWS S3 supporting read-after-first-put consistency in most regions also really helps with the above scheme
  • 88. In theory, this is where we would discuss MagFS’s security architecture. However, as you observed, security is baked into the product at every level and has been covered throughout the deck.We will therefore only recap here.
  • 89. Recap: On-Premises Security Model • User authentication and permissions derived from native Active Directory setup • Encryption keys are never exposed to the cloud • Data and metadata is always encrypted:At-Rest and In-Flight
  • 90. Quick point about Active Directory (AD):The fact that all our user permissions, group membership information, and other authentication information is derived from AD makes it very simple for admins and using MagFS does not change their workflows.
  • 91. Slides (with speaker notes) at http://tolia.org Try MagFS at http://maginatics.com