SlideShare ist ein Scribd-Unternehmen logo
1 von 33
www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system for managing structured time
series data at Two Sigma
Saurabh Goel
saurabh.goel@twosigma.com
Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer
to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon
for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without
notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of
such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two
Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark
does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Motivation
September 13, 2018
• Why have specialized storage for time series data ?
 Extremely common at Two Sigma
 Time is one of the primary dimensions along which applications want to partition and
filter data
 Scale – in terms of both size and access
 Optimizing for the target application workload and requirements
Proprietary and Confidential – Not for Redistribution
Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table
• File system like operations but with database like properties like atomicity
and an isolation model for concurrent access
• Centrally managed service at TS
• Higher expectations around reliability, availability, and multi-tenancy
(security, access control, fair sharing of resources, etc)
• Storage efficiency is also a major concern given the overall size of data stored
Proprietary and Confidential – Not for Redistribution
File system ------------------------------ Smooth --------------- Database
Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to be batch oriented; care more about throughput than latency
• New use cases are demanding better latency, smaller IO, more query power
• Not good for workloads that require very low latencies or issue large numbers
of small reads and writes
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Data Model
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Tables with schema; mandatory time column
• Rows ordered and indexed by time
• Not relational – duplicate timestamps/rows allowed; no notion of primary key
but users can enforce PK constraints in their applications
• Easy to update schema
• Can store wide sparse schemas efficiently
Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced by the given set of new rows
Proprietary and Confidential – Not for Redistribution
WriteSession s = write(table, [10, 42));
s.addRow(<10, ..>);
s.addRow(<15, ..>);
// repeated timestamp is ok
s.addRow(<15, ..>);
// rows must be added in non-decreasing order
s.addRow(<10, ..>);
// rows must lie within the given time range
s.addRow(<50, ..>);
s.commit();
Write API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Set of write operations to a table forms a total order; internally each write
gets a unique, strictly monotonically increasing logical commit timestamp
• Distributed atomic writes are possible
• Delete is just a special case of update where no new rows are written
Read API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Rows returned are based on the latest committed view of the table at the
start of the read operation. Remains isolated from concurrent writes.
Read API
• Snapshot reads over a given time range
Iterator<Row> i = read(table, time range);
while(i.hasNext()) {
doSomething(i.next());
}
Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Distributed snapshot reads
• Reads in the past, permanent snapshots
• Atomic read-modify-write operations using optimistic concurrency control
(OCC) on the commit time
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Table Implementation
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Shard 2
Shard 1
overwritten time range
Committime
c1
c2
Data
file
Replica
Data file contains the new
set of ordered rows;
immutable and indexed;
potentially replicated
Shard is the internal representation
of an update operation;
semantically immutable
Data layer
Metadata layer
Read Algorithm
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
Read this range
start of
read
Reads are implemented by
concatenating together visible
subranges of overlapping shards - we
call this the “read plan”
The underlying data file per shard is
ordered and indexed and can efficiently
select rows belonging to visible sub-
ranges
Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree
Proprietary and Confidential – Not for Redistribution
Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
contiguously
• Data block is the unit of read; variable sized and compressed; typically small
number of MBs; allow random access and parallelization
• Currently use lz4 for most of the files; very low overhead but still gives us
about 2x compression on average; have used gzip for some of the cold data
files
Proprietary and Confidential – Not for Redistribution
Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the read plan; leads to slow reads, and excessive
seeks on the backend data stores reducing overall serving capacity
• Metadata bloat; small shards/files means larger metadata on smooth and
object stores
• Garbage; data under hidden ranges can be garbage collected
Proprietary and Confidential – Not for Redistribution
Compaction Process
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
New compacted shard
committed here
New compacted
shard
Deleted after the new
shard is committed
Underlying data files
are not immediately
deleted to support
ongoing reads
Only contiguous fragments can be combined
together!
Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutable shards with embedded B-trees are similar to “sstables”
• both have compaction processes aimed at similar objectives
• Differ in details – each shard carries with itself a “bulk delete” tombstone
whose handling is deferred till compaction time
• read algorithm is different – no row level comparison for “next” operation
• Key-value stores can use similar ideas to optimize bulk deletes
Proprietary and Confidential – Not for Redistribution
Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• Has not been an issue in practice – less than 10 on average
• If the write workload gets more challenging (i.e. higher rate of small random
writes)
• Use leveled compaction similar to traditional key-value based LSM storage
engines
• by allowing non-contiguous shards to be combined – shards essentially get moved
into data files
• would make our read algorithm more complex - need to merge read plans from all
levels
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to backup servers in a remote data center
• Stateless metadata servers front the database providing functions like
authorization, quota enforcement, and qos (fair sharing of resources)
• Applications link with a smooth client library in order to access smooth
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be plugged into smooth and federated
together for scaling, or replicated across for geo-redundancy/availability, or
used for storage tiering.
• Currently we use HDFS for warm data and CELFS for cold data; CELFS is an
internal archival file system at TS
Proprietary and Confidential – Not for Redistribution
Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-
once data files) and semantic (shards)
• The combination of linear metadata (i.e. strictly increasing commit
timestamps) and immutable elements means that user reads and updates, the
shard compaction process, and physical data movement process can operate
in parallel with no interference and with minimal coordination
• Data files can be cached without worrying about consistency
This simple model has been central to keeping the system simple, robust and
scalable.
Proprietary and Confidential – Not for Redistribution
Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before decompressing)
• 100s of millions of files/shards
• 10s of millions of tables
• 10s of thousands of concurrent requests
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer that spans even to sites that don’t store
data
• Encryption at rest may be important for cloud use cases
• More cost-efficient multi-dc replication and cold data storage
• Data stores that use erasure coding
• More efficient data encoding and compression
• Data stores that can replicate data across data centers and support
desirable failover semantics
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major issue
with HDFS
• Issues with slow serialization and parsing of rows
• More challenging workloads
• Interactive workloads are becoming common – latency sensitive
• Column filtering
• Complex read queries
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged together
by time, like prices per stock ticker. The sub-series is typically identified by
another column. The cardinality of this column is generally in 10k to 20k
range
• Example query: given an arbitrary subset of tickers and a time range, return all
matching rows ordered by time
• In reality each ticker has its own time range, and there are several variations
of this query
• Looking at new kinds of indexing
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Multi-language support
• Opens up many architectural possibilities like caching, easier access control,
Qos, etc
• Various other reliability, multi-tenancy, metadata scaling, security and
operability improvements
Proprietary and Confidential – Not for Redistribution
September 13, 2018
Thank You!
Proprietary and Confidential – Not for Redistribution

Weitere ähnliche Inhalte

Was ist angesagt?

Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
HostedbyConfluent
 

Was ist angesagt? (20)

Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
The art of implementing data lineage
The art of implementing data lineageThe art of implementing data lineage
The art of implementing data lineage
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
 
Blockchain Payment Systems
Blockchain Payment SystemsBlockchain Payment Systems
Blockchain Payment Systems
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Creating a Healthcare Data Fabric, and Providing a Single, Unified, and Curat...
Creating a Healthcare Data Fabric, and Providing a Single, Unified, and Curat...Creating a Healthcare Data Fabric, and Providing a Single, Unified, and Curat...
Creating a Healthcare Data Fabric, and Providing a Single, Unified, and Curat...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]
 
Neptune webinar AWS
Neptune webinar AWS Neptune webinar AWS
Neptune webinar AWS
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Palantir Company Preso
Palantir Company PresoPalantir Company Preso
Palantir Company Preso
 
Leveraging Generative AI to Accelerate Graph Innovation for National Security...
Leveraging Generative AI to Accelerate Graph Innovation for National Security...Leveraging Generative AI to Accelerate Graph Innovation for National Security...
Leveraging Generative AI to Accelerate Graph Innovation for National Security...
 
Talend Data Quality
Talend Data QualityTalend Data Quality
Talend Data Quality
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Open banking [Evolution, Risks & Opportunities]
Open banking [Evolution, Risks & Opportunities]Open banking [Evolution, Risks & Opportunities]
Open banking [Evolution, Risks & Opportunities]
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
 
Banking-as-a-Service 2.0 - Executive Summary
Banking-as-a-Service 2.0 - Executive SummaryBanking-as-a-Service 2.0 - Executive Summary
Banking-as-a-Service 2.0 - Executive Summary
 

Ähnlich wie Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Ähnlich wie Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma (20)

Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Serverless Datalake Day with AWS
Serverless Datalake Day with AWSServerless Datalake Day with AWS
Serverless Datalake Day with AWS
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Amazon Aurora: Database Week SF
Amazon Aurora: Database Week SFAmazon Aurora: Database Week SF
Amazon Aurora: Database Week SF
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 

Mehr von Two Sigma

Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
Two Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
Two Sigma
 

Mehr von Two Sigma (19)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Kürzlich hochgeladen

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Kürzlich hochgeladen (20)

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

  • 1. www.twosigma.com Smooth Storage September 13, 2018Proprietary and Confidential – Not for Redistribution A storage system for managing structured time series data at Two Sigma Saurabh Goel saurabh.goel@twosigma.com
  • 2. Disclaimer This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
  • 3. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 4. Motivation September 13, 2018 • Why have specialized storage for time series data ?  Extremely common at Two Sigma  Time is one of the primary dimensions along which applications want to partition and filter data  Scale – in terms of both size and access  Optimizing for the target application workload and requirements Proprietary and Confidential – Not for Redistribution
  • 5. Smooth’s design emphasis September 13, 2018 • Optimized for range queries and range updates executed in parallel per table • File system like operations but with database like properties like atomicity and an isolation model for concurrent access • Centrally managed service at TS • Higher expectations around reliability, availability, and multi-tenancy (security, access control, fair sharing of resources, etc) • Storage efficiency is also a major concern given the overall size of data stored Proprietary and Confidential – Not for Redistribution File system ------------------------------ Smooth --------------- Database
  • 6. Target Application characteristics September 13, 2018 • Parallel time partitioned jobs that move a lot of data • Tend to be batch oriented; care more about throughput than latency • New use cases are demanding better latency, smaller IO, more query power • Not good for workloads that require very low latencies or issue large numbers of small reads and writes Proprietary and Confidential – Not for Redistribution
  • 7. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 8. Data Model September 13, 2018Proprietary and Confidential – Not for Redistribution • Tables with schema; mandatory time column • Rows ordered and indexed by time • Not relational – duplicate timestamps/rows allowed; no notion of primary key but users can enforce PK constraints in their applications • Easy to update schema • Can store wide sparse schemas efficiently
  • 9. Write API September 13, 2018 Updates a given time range atomically; the existing rows belonging to the range are replaced by the given set of new rows Proprietary and Confidential – Not for Redistribution WriteSession s = write(table, [10, 42)); s.addRow(<10, ..>); s.addRow(<15, ..>); // repeated timestamp is ok s.addRow(<15, ..>); // rows must be added in non-decreasing order s.addRow(<10, ..>); // rows must lie within the given time range s.addRow(<50, ..>); s.commit();
  • 10. Write API September 13, 2018Proprietary and Confidential – Not for Redistribution • Set of write operations to a table forms a total order; internally each write gets a unique, strictly monotonically increasing logical commit timestamp • Distributed atomic writes are possible • Delete is just a special case of update where no new rows are written
  • 11. Read API September 13, 2018Proprietary and Confidential – Not for Redistribution • Rows returned are based on the latest committed view of the table at the start of the read operation. Remains isolated from concurrent writes. Read API • Snapshot reads over a given time range Iterator<Row> i = read(table, time range); while(i.hasNext()) { doSomething(i.next()); }
  • 12. Other Operations September 13, 2018 • Some operations that are not officially supported but a natural fit for smooth • Distributed snapshot reads • Reads in the past, permanent snapshots • Atomic read-modify-write operations using optimistic concurrency control (OCC) on the commit time Proprietary and Confidential – Not for Redistribution
  • 13. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 14. Table Implementation September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Shard 2 Shard 1 overwritten time range Committime c1 c2 Data file Replica Data file contains the new set of ordered rows; immutable and indexed; potentially replicated Shard is the internal representation of an update operation; semantically immutable Data layer Metadata layer
  • 15. Read Algorithm September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 Read this range start of read Reads are implemented by concatenating together visible subranges of overlapping shards - we call this the “read plan” The underlying data file per shard is ordered and indexed and can efficiently select rows belonging to visible sub- ranges
  • 16. Data File format September 13, 2018 The underlying data file is indexed using a simple two level static B+Tree Proprietary and Confidential – Not for Redistribution
  • 17. Data File format September 13, 2018 A data file has one index block and individually compressed data blocks laid out contiguously • Data block is the unit of read; variable sized and compressed; typically small number of MBs; allow random access and parallelization • Currently use lz4 for most of the files; very low overhead but still gives us about 2x compression on average; have used gzip for some of the cold data files Proprietary and Confidential – Not for Redistribution
  • 18. Compaction September 13, 2018 Problem: overwrites of random time ranges and small writes • Excessive fragmentation of the read plan; leads to slow reads, and excessive seeks on the backend data stores reducing overall serving capacity • Metadata bloat; small shards/files means larger metadata on smooth and object stores • Garbage; data under hidden ranges can be garbage collected Proprietary and Confidential – Not for Redistribution
  • 19. Compaction Process September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 New compacted shard committed here New compacted shard Deleted after the new shard is committed Underlying data files are not immediately deleted to support ongoing reads Only contiguous fragments can be combined together!
  • 20. Comparing with LSM September 13, 2018 Similar to Log Structured Merge (LSM) tree • Smooth impl is log structured • immutable shards with embedded B-trees are similar to “sstables” • both have compaction processes aimed at similar objectives • Differ in details – each shard carries with itself a “bulk delete” tombstone whose handling is deferred till compaction time • read algorithm is different – no row level comparison for “next” operation • Key-value stores can use similar ideas to optimize bulk deletes Proprietary and Confidential – Not for Redistribution
  • 21. Write Amplification September 13, 2018 • Write amplification = actual bytes written to storage / bytes written by user • Has not been an issue in practice – less than 10 on average • If the write workload gets more challenging (i.e. higher rate of small random writes) • Use leveled compaction similar to traditional key-value based LSM storage engines • by allowing non-contiguous shards to be combined – shards essentially get moved into data files • would make our read algorithm more complex - need to merge read plans from all levels Proprietary and Confidential – Not for Redistribution
  • 22. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 23. System Architecture September 13, 2018Proprietary and Confidential – Not for Redistribution
  • 24. System Architecture September 13, 2018 • All smooth metadata is stored on Microsoft Sql Server which gets replicated to backup servers in a remote data center • Stateless metadata servers front the database providing functions like authorization, quota enforcement, and qos (fair sharing of resources) • Applications link with a smooth client library in order to access smooth Proprietary and Confidential – Not for Redistribution
  • 25. System Architecture September 13, 2018 • Data files are stored in object stores • Multiple different types of OSs can be plugged into smooth and federated together for scaling, or replicated across for geo-redundancy/availability, or used for storage tiering. • Currently we use HDFS for warm data and CELFS for cold data; CELFS is an internal archival file system at TS Proprietary and Confidential – Not for Redistribution
  • 26. Virtues of Immutability September 13, 2018 • A design principle we have been using is immutability - both physical (write- once data files) and semantic (shards) • The combination of linear metadata (i.e. strictly increasing commit timestamps) and immutable elements means that user reads and updates, the shard compaction process, and physical data movement process can operate in parallel with no interference and with minimal coordination • Data files can be cached without worrying about consistency This simple model has been central to keeping the system simple, robust and scalable. Proprietary and Confidential – Not for Redistribution
  • 27. Some Statistics September 13, 2018 • Multiple PBs of unique compressed data • Read peaks in excess of 100 GB/s (before decompressing) • 100s of millions of files/shards • 10s of millions of tables • 10s of thousands of concurrent requests Proprietary and Confidential – Not for Redistribution
  • 28. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 29. Looking Forward September 13, 2018 • Multi-datacenter and public cloud read scaling • CDN like distributed caching layer that spans even to sites that don’t store data • Encryption at rest may be important for cloud use cases • More cost-efficient multi-dc replication and cold data storage • Data stores that use erasure coding • More efficient data encoding and compression • Data stores that can replicate data across data centers and support desirable failover semantics Proprietary and Confidential – Not for Redistribution
  • 30. Looking Forward September 13, 2018 • Performance • Performance consistency is a major concern - tail latencies are a major issue with HDFS • Issues with slow serialization and parsing of rows • More challenging workloads • Interactive workloads are becoming common – latency sensitive • Column filtering • Complex read queries Proprietary and Confidential – Not for Redistribution
  • 31. Looking Forward September 13, 2018 Complex queries • Common for time series datasets to have multiple sub-series merged together by time, like prices per stock ticker. The sub-series is typically identified by another column. The cardinality of this column is generally in 10k to 20k range • Example query: given an arbitrary subset of tickers and a time range, return all matching rows ordered by time • In reality each ticker has its own time range, and there are several variations of this query • Looking at new kinds of indexing Proprietary and Confidential – Not for Redistribution
  • 32. Looking Forward September 13, 2018 • Moving away from a “thick” smooth client • Enables quick iteration and bug fixes • Multi-language support • Opens up many architectural possibilities like caching, easier access control, Qos, etc • Various other reliability, multi-tenancy, metadata scaling, security and operability improvements Proprietary and Confidential – Not for Redistribution
  • 33. September 13, 2018 Thank You! Proprietary and Confidential – Not for Redistribution

Hinweis der Redaktion

  1. A shard is semantically immutable, i.e. it always returns the same set of rows The physical representation of the underlying data can change in format or storage location or be replicated
  2. Gets the read plan for the entire time range and finds areas with excessive fragmentation (many small fragments) Selects a contiguous segment of the read plan containing fragments to be fixed, and rewrites them as a single new shard. The commit time of the new shard is the max of participating input shards – this makes sure the compaction process does not interfere with ongoing writes The underlying data files for the deleted shards are not immediately removed so that references from read plans of ongoing reads remain valid