Even though there have been a large number of proposals to accelerate databases using specialized hardware, often the opinion of the community is pessimistic: the performance and energy efficiency benefits of specialization are seen to be outweighed by the limitations of the proposed solutions and the additional complexity of including specialized hardware, such as field programmable gate arrays (FPGAs), in servers. Recently, however, as an effect of stagnating CPU performance, server architectures started to incorporate various programmable hardware and the availability of such components brings opportunities to databases. In the light of a shifting hardware landscape and emerging analytics workloads, it is time to revisit our stance on hardware acceleration. In this talk we highlight several challenges that have traditionally hindered the deployment of hardware acceleration in databases and explain how they have been alleviated or removed altogether by recent research results and the changing hardware landscape. We also highlight a new set of questions that emerge around deep integration of heterogeneous programmable hardware in tomorrow’s databases.
DC MACHINE-Motoring and generation, Armature circuit equation
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Databases
1. The Glass Half Full
Using Programmable Hardware Accelerators in
Analytical Databases
Zsolt István
IMDEA Software Institute 1
2. IMDEA Software
Institute
• 16 Faculty in the areas of:
• Program Analysis and Verification
• Languages and Compilers
• Security and Privacy
• Theoretical Computer Science
• Distributed Systems and Databases
• ~10 Post-docs, ~25 PhD Students,
~10 Interns
• Located in UPM Montegancedo Campus,
Madrid
• We are hiring! https://software.imdea.org/
3. ▪ OLAP – Online Analytical Processing
▪ Large datasets – up to TBs
▪ Ad-hoc querying to extract insight, recurring
reporting – Possibly complex operations
▪ Read-mostly workloads, updates in batches
▪ OLTP – Online Transaction Processing
▪ Smaller datasets
▪ Queries known, relate to business actions
▪ Makes heavy use of indexes
▪ Reads and updates intermixed
3
Context: Analytical Databases
4. 4
Databases were a 25 Billion $ market in 2018…
Could we specialize machines to them?
https://www.statista.com/statistics/810188/worldwide-commercial-database-market-size/
5. ▪ Fully custom machine for databases
▪ Processors – special ISA microprocessors
▪ Memory – magnetic bubbles and CCDs
▪ Semiconductor technology and
general purpose CPUs took over
5
Database Computer – ’70s
“The first goal is to design it with the
capability of handling a very large on-line
database of 10^10 bytes or beyond since
special-purpose machines are not likely to
be cost-effective for small databases.”
Jayanta Banerjee, David K. Hsiao, Krishnamurthi Kannan: DBC - A Database Computer for Very Large Databases.
IEEE Trans. Computers 28(6): 414-429 (1979)
6. ▪ Based on VAX multi-
processor system
▪ By the time the software
and hardware were
developed, CPUs have
become much faster
▪ Couldn’t keep up with
Moore’s law
6
Gamma Machine – ’80s
David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna: GAMMA - A High
Performance Dataflow Database Machine. VLDB 1986: 228-237
11. ▪ Accelerator
▪ Amazon F1
▪ In data path
▪ Microsoft Catapult
▪ Co-processor
▪ Intel Xeon+FPGA
11
In the Cloud Today
Socket1 Socket2
CPU FPGA
Socket1
CPU FPGA
CPU
FPGA
Intel Xeon+FPGA Gen.1 Intel Xeon+FPGA Gen.2
13. ▪ 1) On the side acceleration introduces overhead
▪ Many related work offers no real speedup if we factor in data
movement, transformation, software overhead… 13
The Glass Half Empty…
0
20
40
60
80
100
120
Software With Acceleration
Query execution time
Compute Data Movement
2xAccel.
Data
14. ▪ 2) “All or nothing” behavior makes query planning difficult
▪ Example: fixed capacity hash table on FPGA
▪ Constant time access for reads and writes
▪ What happens if data doesn’t fit?
▪ Can’t always know the number of keys aprioi
14
The Glass Half Empty…
#
15. ▪ 3) Analytical databases becoming more optimized / not much
compute in core SQL
▪ X100 [CIDR05] showed that <10% of compute time spent on SQL
operators +,-,*,SUM,AVG in analytical queries
▪ Columnar stores often memory bound (10s of GB/s)
15
The Glass Half Empty…
16. ▪ On the side acceleration introduces overhead
▪ “All or nothing” behavior makes query planning difficult
▪ Analytical databases becoming more optimized / not much
compute in core SQL
16
The Glass Half Empty…
17. ▪ On the side acceleration introduces overhead
✓ Reduce data movement bottlenecks
17
The Glass Half Full…
18. ▪ IBEX: Database storage engine with processing offload
▪ Filter and pre-aggregate for analytic workloads
18
Processing in data path: Smart Flash
Database Server
IBEX
SSD
IBEX – An Intelligent Storage Engine with Support for Advanced SQL Off-loading. L. Woods, Z. Istvan
and G. Alonso, VLDB’14
→ Larger bandwidth, more IOPS
(Samsung YourSQL, MIT BlueDBM)
▪ Opportunity to extend SSDs/Flash
with complex offload
Samsung “smart” SSD
19. 19
Processing in data path: Distributed Processing
Workers(Compute)
Storage
+ Provisioning
+ Scalability
Caribou: Distributed
storage with processing
• Specialized HW nodes
• 10Gbps access
• 25W power cons.
Zsolt István, David Sidler, Gustavo Alonso: Caribou: Intelligent Distributed Storage. PVLDB 10(11), 2017.
20. 20
Smart Storage in Databases: Filter push-down
Intel Hyperscan library (Xeon E5-2680 v2)
2.8x
SELECT … FROM customer
WHERE age<35 AND purchases>2
AND address LIKE “%PO. Box 123%”
▪ Challenge: guarantee that filtering never slows down retrieval
▪ Algorithms can be re-imagined to become bandwidth-bound
instead of compute-bound
▪ Extend the state of the art: parameterization without re-programming [FCCM16]
▪ Many options: Regular expressions, comparisons, decompression, …
[FCCM16] Runtime Parameterizable Regular Expression Operators for Databases. Zs. Istvan, D. Sidler, G. Alonso. FCCM’16
21. ✓ Reduce data movement bottlenecks
▪ “All or nothing” behavior makes query planning difficult
✓ Hybrid processing
21
The Glass Half Full…
22. ▪ Group-by: Compute aggregate function over categories
▪ select avg(salary) from employees group by department
22
IBEX’s Hybrid Group-by
CPUIbex with SW-only Group-By
Projection Selection Group-by
Final
Group
s
Input table
Filtered
data
23. ▪ Group-by: Compute aggregate function over categories
▪ select avg(salary) from employees group by department
23
IBEX’s Hybrid Group-by
CPUIbex with HW-only Group-By
Projection Selection Group-by
Final
Group
s
Input table
Filtered
data
24. CPUIbex with HW-only Group-By
Projection Selection Group-by
Final
Group
s
Input table
Filtered
data
▪ Group-by: Compute aggregate function over categories
▪ select avg(salary) from employees group by department
▪ If number of groups does not fit on FPGA?
▪ Send partial aggregates – finalize in SW
▪ Worst case: same as no acceleration
▪ Best- case: All in HW!
24
IBEX’s Hybrid Group-by
CPUIbex with Hybrid Group-by
Input table Projection Selection Group-by Group-by
Final
Group
s
Filtered
data
Partial
Group
s
Challenge: How to split across accelerator and software?
25. ✓ Reduce data movement bottlenecks
✓ Hybrid Processing
▪ Analytical databases becoming more optimized / not much
compute in core SQL
✓ Emerging compute-intensive workloads
25
The Glass Half Full
26. ▪ Databases adopting new ways of analyzing the data
▪ SAP Hana, Oracle, SQL Server, etc.
▪ Specialized hardware can help both with model building [Kara18],
inference [Owaida18]
▪ Benefits for “classical” algorithms as well
[Kara18] Kara et al: ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4): 348-361 (2018)
[Owaida18] Owaida et al: Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. FPL 2018: 295-300
26
The Rise of Machine Learning
27. CPUFPGA Co-processor
27
doppioDB: a hybrid database engine
Database
Engine
(MonetDB)
Hardware
Operator
Software
operator
Software
operator
Hardware
Operator
Hardware
Operator
▪ Goal: extend the capabilities of analytical databases
▪ FPGA works on the same data as software (cache-coherent access)
▪ Can combine SW and HW operators inside the same query
▪ Challenge: ensure high utilization of FPGA, use in many queries
DRAM (DB Tables)
No data copy,
transformation,
partitioning, etc.
Hardware
Operator
28. K-means – Algorithm
◼ Goal: partition unlabeled data into several
clusters, where the number of clusters is
the “k” in the k-means.
◼ Two steps in each iteration:
◼ Assignment: assign data points to
closet centroid according to distance
metric
◼ Centroid update: the centroids are re-
calculated by averaging all the data
points within each cluster
◼ Long process if the data set and number of
iterations are large
28
29. DRAM
(DB Tables)
Design – Execution Walk-Through
Receives K-Means parameters1
Fetch the initial centroids and
the data
2
3 Calculates the distance between
a data point and all the centroids
and assign it to closest centroid
4 Accumulates data points per cluster and
counts how many data points are assigned to
each cluster
Collect partial results from each pipeline5
Division for updating new centroid6
Writes back the final results7
1
2
3 4
56
7
Zhenhao He, David Sidler, Zsolt István, Gustavo Alonso: A Flexible K-Means Operator for Hybrid Databases. FPL 2018
29
30. 30
Uses of Parallelism
K is known /
Centroids
known
Need to determine K
(Elbow method)
▪ K-Means algorithm
▪ FPGA outperforms several cores of the CPU
▪ Can use parallelism in two ways – cover more queries
32. ✓ Reduce data movement bottlenecks
✓ Hybrid Processing
✓ Emerging compute-intensive workloads
32
The Glass Half Full…
Future Challenges…
▪ Managing Programmable Hardware accelerators
▪ Is this the job of the OS or does the DB has to take control?
▪ How to share programmable hardware across tenants
▪ Compilation/synthesis of hardware accelerators
▪ Can we derive accelerators from user queries?
▪ Intermediary DSL or building blocks we could use?
For more details, see: The Glass Half Full: Using Programmable Hardware Accelerators in Analytics. Z. István. IEEE Data Engineering
Bulletin, March 2019.