SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
TritonSort
A Balanced Large-Scale
Sorting System
Alex Rasmussen, George Porter, Michael Conley,
Radhika Niranjan Mysore, Amin Vahdat (UCSD)
Harsha V. Madhyastha (UC Riverside)
Alexander Pucher (Vienna University of Technology)
The Rise of Big Data Workloads
•  Very high I/O and storage requirements
–  Large-scale web and social graph mining
–  Business analytics – “you may also like …”
–  Large-scale “data science”
•  Recent new approaches to “data deluge”: data
intensive scalable computing (DISC) systems
–  MapReduce, Hadoop, Dryad, …
2
Performance via scalability
•  10,000+ node MapReduce clusters deployed
–  With impressive performance
•  Example: Yahoo! Hadoop Cluster Sort
–  3,452 nodes sorting 100TB in less than 3 hours
•  But…
–  Less Than 3 MB/sec per node
–  Single disk: ~100 MB/sec
•  Not an isolated case
–  See “Efficiency Matters!”,
SIGOPS 2010
3
Overcoming Inefficiency With
Brute Force
•  Just add more machines!
–  But expensive, power-hungry
mega-datacenters!
•  What if we could go from
3 MBps per node to 30?
–  10x fewer machines
accomplishing the
same task
–  or 10x higher throughput
4
TritonSort Goals
•  Build a highly efficient DISC system that
improves per-node efficiency by an order of
magnitude vs. existing systems
–  Through balanced hardware and software
•  Secondary goals:
–  Completely “off-the-shelf” components
–  Focus on I/O-driven workloads (“Big Data”)
–  Problems that don’t come close to fitting in RAM
–  Initially sorting, but have since generalized
5
Outline
•  Define hardware and software balance
•  TritonSort design
– Highlighting tradeoffs to achieve balance
•  Evaluation with sorting as a case study
6
Building a “Balanced” System
•  Balanced hardware drives
all resources as close to
100% as possible
–  Removing any resource
slows us down
–  Limited by commodity
configuration choices
•  Balanced software fully
exploits hardware resources
7
Hardware Selection
•  Designed for I/O-heavy workloads
– Not just sorting
•  Static selection of resources:
– Network/disk balance
•  10 Gbps / 80 MBps ≈ 16 disks
– CPU/disk balance
•  2 disks / core = 8 cores
– CPU/memory
•  Originally ~1.5GB/core… later 3 GB/core
8
Resulting Hardware Platform
52 Nodes:
•  Xeon E5520, 8 cores
(16 with hyperthreading)
•  24 GB RAM
•  16 7200 RPM hard drives
•  10 Gbps NIC
•  Cisco Nexus 5020
10 Gbps switch
9
Software Architecture
•  Staged, pipeline-oriented dataflow system
•  Program expressed as digraph of stages
–  Data stored in buffers that move along edges
–  Stage’s work performed by worker threads
•  Platform for experimentation
–  Easily vary:
•  Stage implementation
•  Size and quantity of buffers
•  Worker threads per stage
•  CPU and memory allocation to each stage
10
Why Sorting?
•  Easy to describe
•  Industrially applicable
•  Uses all cluster resources
11
Current TritonSort Architecture
•  External sort – two reads, two writes*
– Don’t read and write to disk at same time
•  Partition disks into input and output
•  Two phases
– Phase one: route tuples to appropriate
on-disk partition (called a “logical disk”) on
appropriate node
– Phase two: sort all logical disks in parallel
12
* A. Aggarwal and J. S. Vitter. The input/output complexity
of sorting and related problems. CACM, 1988.
Architecture Phase One
13
Input Disks
Reader
Node
Distributor
Sender
Architecture Phase One
14
Receiver
LD
Distributor
Coalescer Writer
Output Disks
Disk 8
Disk 7
Disk 6
Disk 5
Disk 4
Disk 3
Disk 2
Disk 1
Linked list per partition
Reader
15
•  100 MBps/disk * 8 disks = 800 MBps
•  No computation, entirely I/O and memory
operations
– Expect most time spent in iowait
– 8 reader workers, one per input disk
 All reader workers co-scheduled on a single core
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
NodeDistributor
16
•  Appends tuples onto a buffer per
destination node
•  Memory scan + hash per tuple
•  300 MBps per worker
– Need three workers to keep up with readers
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Sender &
Receiver
17
•  800 MBps (from Reader) is 6.4 Gbps
– All-to-all traffic
•  Must keep downstream disks busy
– Don’t let receive buffer get empty
– Implies strict socket send time bound
•  Multiplex all senders on one core
(single-threaded tight loop)
– Visit every socket every 20 µs
– Didn’t need epoll()/select()
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Balancing at Scale
18
t1t0
Logical Disk
Distributor
19
t0 t1 t2
0
1
N
…
H(t0) = 1H(t1) = N
12.8 KB
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Logical Disk
Distributor
20
•  Data non-uniform and bursty at
short timescales
– Big buffers + burstiness = head-of-line blocking
– Need to use all your memory all the time
•  Solution: Read incoming data into smallest
buffer possible, and form chains
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Coalescer &
Writer
21
•  Copies tuples from LDBuffer chains into a
single, sequential block of memory
•  Longer chains = larger write before seeking
= faster writes
– Also, more memory needed for LDBuffers
•  Buffer size limits maximum chain length
– How big should this buffer be?
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Writer
22
Reader
Node
Distributor
Sender
Receiver
L.D.
Distributor
Coalescer Writer
Architecture Phase Two
23
Reader Sorter Writer
Input Disks Output Disks
Sort Benchmark Challenge
•  Started in 1980s by Jim Gray, now run by
a committee of volunteers
•  Annual competition with many categories
– GraySort: Sort 100 TB
•  “Indy” variant
– 10 byte key, 90 byte value
– Uniform key distribution
24
How balanced are we?
25
Worker Type Workers Total Throughput
(MBps)
% Over
Bottleneck
Stage
Reader 8 683 13%
Node-Distributor 3 932 55%
LD-Distributor 1 683 13%
Coalescer 8 18,593 30,000%
Writer 8 601 0%
Reader 8 740 3.2%
Sorter 4 1089 52%
Writer 8 717 0%
How balanced are we?
26
Phase
Resource Utilization
CPU Memory Network Disk
Phase
One
25% 100% 50% 82%
Phase
Two
50% 100% 0% 100%
Scalability
27
Raw 100TB “Indy” Performance
28
0
0.0025
0.005
0.0075
0.01
0.0125
0.015
0.0175
0.02
Prev. Record Holder TritonSort
PerformanceperNode
(TBperminute) 0.938 TB per minute
52 nodes
0.564 TB per minute
195 nodes
6X
Impact of Faster Disks
•  7.2K RPM  15K RPM drives
•  Smaller capacity means fewer LDs
•  Examined effect of disk speed and # LDs
•  Removing a bottleneck moves the bottleneck
somewhere else
29
Intermediate
Disk Speed
(RPM)
Logical Disks
Per Physical
Disk
Phase One
Throughput
(MBps)
Phase One
Bottleneck
Stage
Average Write
Size (MB)
7200 315 69.81 Writer 12.6
7200 158 77.89 Writer 14.0
15000 158 79.73 LD Distributor 5.02
Impact of Increased RAM
•  Hypothesis that memory influences chain length,
and thus write speed
•  Doubling memory indeed increases chain length,
but the effect on performance was minimal
•  Increasing a non-bottleneck resource made it
faster, but not by much
30
RAM Per Node
(GB)
Phase One Throughput
(MBps)
Average Write Size
(MB)
24 73.53 12.43
48 76.43 19.21
Future Work
•  Generalization
– We have a fast MapReduce implementation
– Considering other applications and
programming paradigms
•  Automatic Tuning
– Determine appropriate buffer size & count, #
workers per stage for reasonable performance
•  Different hardware
•  Different workloads
31
TritonSort – Questions?
•  Proof-of-concept
balanced sorting system
•  6x improvement in per-
node efficiency vs.
previous record holder
•  Current top speed:
938 GB per minute
•  Future Work:
Generalization,
Automation
32
http://tritonsort.eng.ucsd.edu/

Weitere ähnliche Inhalte

Was ist angesagt?

TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)Ontico
 
Ndb cluster 80_use_cases
Ndb cluster 80_use_casesNdb cluster 80_use_cases
Ndb cluster 80_use_casesmikaelronstrom
 
Ndb cluster 80_oci_dbt2
Ndb cluster 80_oci_dbt2Ndb cluster 80_oci_dbt2
Ndb cluster 80_oci_dbt2mikaelronstrom
 
Overview of sheepdog
Overview of sheepdogOverview of sheepdog
Overview of sheepdogLiu Yuan
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)Lars Marowsky-Brée
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with PacemakerKris Buytaert
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2Nick Wang
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status ReportLiu Yuan
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLYoshinori Matsunobu
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureHadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureErik Krogen
 

Was ist angesagt? (20)

TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)
 
Ndb cluster 80_use_cases
Ndb cluster 80_use_casesNdb cluster 80_use_cases
Ndb cluster 80_use_cases
 
Ndb cluster 80_oci_dbt2
Ndb cluster 80_oci_dbt2Ndb cluster 80_oci_dbt2
Ndb cluster 80_oci_dbt2
 
Overview of sheepdog
Overview of sheepdogOverview of sheepdog
Overview of sheepdog
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Bluestore
BluestoreBluestore
Bluestore
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureHadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On Azure
 

Andere mochten auch

Technical Trends_Study of Quantum
Technical Trends_Study of QuantumTechnical Trends_Study of Quantum
Technical Trends_Study of QuantumHardik Gohel
 
Thesis-Generic Sorting System
Thesis-Generic Sorting SystemThesis-Generic Sorting System
Thesis-Generic Sorting SystemBlake Kiely
 
Application of image processing in material handling and (1)
Application of image processing in material handling and (1)Application of image processing in material handling and (1)
Application of image processing in material handling and (1)suyash dani
 
Automatic sorting machine (cpu)
Automatic sorting machine (cpu)Automatic sorting machine (cpu)
Automatic sorting machine (cpu)vishnucool
 

Andere mochten auch (6)

Technical Trends_Study of Quantum
Technical Trends_Study of QuantumTechnical Trends_Study of Quantum
Technical Trends_Study of Quantum
 
Thesis-Generic Sorting System
Thesis-Generic Sorting SystemThesis-Generic Sorting System
Thesis-Generic Sorting System
 
Application of image processing in material handling and (1)
Application of image processing in material handling and (1)Application of image processing in material handling and (1)
Application of image processing in material handling and (1)
 
Automatic sorting machine (cpu)
Automatic sorting machine (cpu)Automatic sorting machine (cpu)
Automatic sorting machine (cpu)
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Ähnlich wie TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)

Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Howard Marks
 
Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Alex Rasmussen
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
 
Intro to Exadata
Intro to ExadataIntro to Exadata
Intro to ExadataMoin Khalid
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM Research
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersSangjin Han
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysisodsc
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 

Ähnlich wie TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011) (20)

Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Kinetic basho public
Kinetic basho publicKinetic basho public
Kinetic basho public
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
Intro to Exadata
Intro to ExadataIntro to Exadata
Intro to Exadata
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOME
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacenters
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 

Mehr von Alex Rasmussen

Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019Alex Rasmussen
 
Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019Alex Rasmussen
 
EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!Alex Rasmussen
 
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018Alex Rasmussen
 
Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016Alex Rasmussen
 
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter StoragePapers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter StorageAlex Rasmussen
 

Mehr von Alex Rasmussen (6)

Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019
 
Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019
 
EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!
 
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
 
Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016
 
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter StoragePapers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
 

Kürzlich hochgeladen

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)

  • 1. TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen, George Porter, Michael Conley, Radhika Niranjan Mysore, Amin Vahdat (UCSD) Harsha V. Madhyastha (UC Riverside) Alexander Pucher (Vienna University of Technology)
  • 2. The Rise of Big Data Workloads •  Very high I/O and storage requirements –  Large-scale web and social graph mining –  Business analytics – “you may also like …” –  Large-scale “data science” •  Recent new approaches to “data deluge”: data intensive scalable computing (DISC) systems –  MapReduce, Hadoop, Dryad, … 2
  • 3. Performance via scalability •  10,000+ node MapReduce clusters deployed –  With impressive performance •  Example: Yahoo! Hadoop Cluster Sort –  3,452 nodes sorting 100TB in less than 3 hours •  But… –  Less Than 3 MB/sec per node –  Single disk: ~100 MB/sec •  Not an isolated case –  See “Efficiency Matters!”, SIGOPS 2010 3
  • 4. Overcoming Inefficiency With Brute Force •  Just add more machines! –  But expensive, power-hungry mega-datacenters! •  What if we could go from 3 MBps per node to 30? –  10x fewer machines accomplishing the same task –  or 10x higher throughput 4
  • 5. TritonSort Goals •  Build a highly efficient DISC system that improves per-node efficiency by an order of magnitude vs. existing systems –  Through balanced hardware and software •  Secondary goals: –  Completely “off-the-shelf” components –  Focus on I/O-driven workloads (“Big Data”) –  Problems that don’t come close to fitting in RAM –  Initially sorting, but have since generalized 5
  • 6. Outline •  Define hardware and software balance •  TritonSort design – Highlighting tradeoffs to achieve balance •  Evaluation with sorting as a case study 6
  • 7. Building a “Balanced” System •  Balanced hardware drives all resources as close to 100% as possible –  Removing any resource slows us down –  Limited by commodity configuration choices •  Balanced software fully exploits hardware resources 7
  • 8. Hardware Selection •  Designed for I/O-heavy workloads – Not just sorting •  Static selection of resources: – Network/disk balance •  10 Gbps / 80 MBps ≈ 16 disks – CPU/disk balance •  2 disks / core = 8 cores – CPU/memory •  Originally ~1.5GB/core… later 3 GB/core 8
  • 9. Resulting Hardware Platform 52 Nodes: •  Xeon E5520, 8 cores (16 with hyperthreading) •  24 GB RAM •  16 7200 RPM hard drives •  10 Gbps NIC •  Cisco Nexus 5020 10 Gbps switch 9
  • 10. Software Architecture •  Staged, pipeline-oriented dataflow system •  Program expressed as digraph of stages –  Data stored in buffers that move along edges –  Stage’s work performed by worker threads •  Platform for experimentation –  Easily vary: •  Stage implementation •  Size and quantity of buffers •  Worker threads per stage •  CPU and memory allocation to each stage 10
  • 11. Why Sorting? •  Easy to describe •  Industrially applicable •  Uses all cluster resources 11
  • 12. Current TritonSort Architecture •  External sort – two reads, two writes* – Don’t read and write to disk at same time •  Partition disks into input and output •  Two phases – Phase one: route tuples to appropriate on-disk partition (called a “logical disk”) on appropriate node – Phase two: sort all logical disks in parallel 12 * A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. CACM, 1988.
  • 13. Architecture Phase One 13 Input Disks Reader Node Distributor Sender
  • 14. Architecture Phase One 14 Receiver LD Distributor Coalescer Writer Output Disks Disk 8 Disk 7 Disk 6 Disk 5 Disk 4 Disk 3 Disk 2 Disk 1 Linked list per partition
  • 15. Reader 15 •  100 MBps/disk * 8 disks = 800 MBps •  No computation, entirely I/O and memory operations – Expect most time spent in iowait – 8 reader workers, one per input disk  All reader workers co-scheduled on a single core Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 16. NodeDistributor 16 •  Appends tuples onto a buffer per destination node •  Memory scan + hash per tuple •  300 MBps per worker – Need three workers to keep up with readers Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 17. Sender & Receiver 17 •  800 MBps (from Reader) is 6.4 Gbps – All-to-all traffic •  Must keep downstream disks busy – Don’t let receive buffer get empty – Implies strict socket send time bound •  Multiplex all senders on one core (single-threaded tight loop) – Visit every socket every 20 µs – Didn’t need epoll()/select() Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 19. t1t0 Logical Disk Distributor 19 t0 t1 t2 0 1 N … H(t0) = 1H(t1) = N 12.8 KB Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 20. Logical Disk Distributor 20 •  Data non-uniform and bursty at short timescales – Big buffers + burstiness = head-of-line blocking – Need to use all your memory all the time •  Solution: Read incoming data into smallest buffer possible, and form chains Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 21. Coalescer & Writer 21 •  Copies tuples from LDBuffer chains into a single, sequential block of memory •  Longer chains = larger write before seeking = faster writes – Also, more memory needed for LDBuffers •  Buffer size limits maximum chain length – How big should this buffer be? Reader Node Distributor Sender Receiver L.D. Distributor Coalescer Writer
  • 23. Architecture Phase Two 23 Reader Sorter Writer Input Disks Output Disks
  • 24. Sort Benchmark Challenge •  Started in 1980s by Jim Gray, now run by a committee of volunteers •  Annual competition with many categories – GraySort: Sort 100 TB •  “Indy” variant – 10 byte key, 90 byte value – Uniform key distribution 24
  • 25. How balanced are we? 25 Worker Type Workers Total Throughput (MBps) % Over Bottleneck Stage Reader 8 683 13% Node-Distributor 3 932 55% LD-Distributor 1 683 13% Coalescer 8 18,593 30,000% Writer 8 601 0% Reader 8 740 3.2% Sorter 4 1089 52% Writer 8 717 0%
  • 26. How balanced are we? 26 Phase Resource Utilization CPU Memory Network Disk Phase One 25% 100% 50% 82% Phase Two 50% 100% 0% 100%
  • 28. Raw 100TB “Indy” Performance 28 0 0.0025 0.005 0.0075 0.01 0.0125 0.015 0.0175 0.02 Prev. Record Holder TritonSort PerformanceperNode (TBperminute) 0.938 TB per minute 52 nodes 0.564 TB per minute 195 nodes 6X
  • 29. Impact of Faster Disks •  7.2K RPM  15K RPM drives •  Smaller capacity means fewer LDs •  Examined effect of disk speed and # LDs •  Removing a bottleneck moves the bottleneck somewhere else 29 Intermediate Disk Speed (RPM) Logical Disks Per Physical Disk Phase One Throughput (MBps) Phase One Bottleneck Stage Average Write Size (MB) 7200 315 69.81 Writer 12.6 7200 158 77.89 Writer 14.0 15000 158 79.73 LD Distributor 5.02
  • 30. Impact of Increased RAM •  Hypothesis that memory influences chain length, and thus write speed •  Doubling memory indeed increases chain length, but the effect on performance was minimal •  Increasing a non-bottleneck resource made it faster, but not by much 30 RAM Per Node (GB) Phase One Throughput (MBps) Average Write Size (MB) 24 73.53 12.43 48 76.43 19.21
  • 31. Future Work •  Generalization – We have a fast MapReduce implementation – Considering other applications and programming paradigms •  Automatic Tuning – Determine appropriate buffer size & count, # workers per stage for reasonable performance •  Different hardware •  Different workloads 31
  • 32. TritonSort – Questions? •  Proof-of-concept balanced sorting system •  6x improvement in per- node efficiency vs. previous record holder •  Current top speed: 938 GB per minute •  Future Work: Generalization, Automation 32 http://tritonsort.eng.ucsd.edu/