The talk will motivate why Apache Arrow and related projects (e.g. DataFusion) is a good choice for implementing modern analytic database systems. It reviews the major components in most databases and explains where Apache Arrow fits in, and explains additional integration benefits from using Arrow.
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
2021 04-20 apache arrow and its impact on the database industry.pptx
1. Andrew Lamb
Guest Lecture
CSC-132 Database Systems Implementation in Spring '21.
UCSD
April 20, 2021
Apache Arrow and its Impact
on the Database industry
2. IOx Team at InfluxData
Apache Arrow PMC Member
Query Optimizer / Architect @ Vertica
Technical Staff @ Oracle, on Database Server
Chief Architect @ DataRobot (ML Startup)
Chief Architect/VP Engineering @ Nutonian (ML Startup)
XLST JIT Compiler Team at DataPower (XML startup)
3. Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤
● Opinions and Perspectives of Databases
● Why and What of Arrow
● Arrow Examples, in Rust
Props to Wes McKinney who provided many of the arrow slides
4. Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
...
5. … and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
with Rust and Arrow
6. And we are not alone
A proliferation of new databases…. And many are choosing to use some of the
Arrow ecosystem
⇒ lets them innovate at a higher level not the nuts and bolts
6
Arrow Parquet
● AWS Athena
● ClickHouse
● Snowflake
● Vertica
● DuckDb
● Greenplum
● DataBricks
● …
Arrow Flight
● Google BigQuery
● Snowflake
● Dremio
● DataBricks
● ...
Arrow Arrays
● Dremio
● TileDB
● SciDB
● ...
7. Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc)
○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent”
○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, SQLite etc etc
● Analytic (OLAP, “Big Data”, etc)
○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated
○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
⇒ Rest of the talk focused on Analytic Databases
8. So, you want to build a new database… ?
Databases need many features just to look like a database:
● Get Data In and Out
● Store Data and Catalog / Metadata
● Query Language + Query Engine
● Connect: Client API
…
Before you can invest in what makes your database special
9. Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
11. Arrow Project Goals
“Build a better open source
foundation for data science”
🤔 How is this related to databases?
https://arrow.apache.org/
12. • Open source project begun in 2016, founded by
Wes McKinney
• Initially focused on applying systems / database
research to data science ecosystems (pandas)
13. Data scientists working with “small” data
have not experienced great pain
Small Data (< ~10GB)
Medium Data (~10 - ~100GB)
Big Data (> ~100GB-1TB)
Current Python/R
stack begins to “fail”
around this point
Users doing fine here
NYC R Conference 2018-04-20
14. We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation (LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
NYC R Conference 2018-04-20
15. • Language agnostic, state-of-the-art library
for building fast data systems
• Arrow Memory Format
• Compute Kernels (not standardized)
• Parquet File Format
• Flight: Data RPC / Network Transfer format
• Query Engines (not standardized)
22. Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format (persistence) => Parquet
Columnar memory representation => Arrow Arrays
Operations (e.g. add, multiply) => Compute Kernels
Network transfer => Arrow Flight IPC
_ => ... to be continued ...
}
24. Code Examples
Thesis: “When writing an analytic database, you will end up implementing the
Arrow feature set”
+
* Take performance comparisons with a large grain of salt
Compare Plain Rust and Rust using the Arrow library
30. Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| s != "us-west")
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
let not_west_bitset =
neq_utf8_scalar(
&array,
"us-west"
).unwrap();
let num_not_west = not_west_bitset
.iter()
.filter(|v| matches!(v, Some(true)))
.count();
> Found 6666667 not in west
~50ms
> Found 6666667 not in west
~120ms
+
31. Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> = ...;
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| {
s.as_ref()
.map(|s| s != "us-west")
.unwrap_or(false)
})
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
+
Same as previous
> Found 6666667 not in west
~50ms
32. Materialize rows for future processing
let not_west: Vec<String> = not_west_bitset
.iter()
.enumerate()
.filter_map(|(i, &v)| {
if v {
Some(string_vec[i].clone())
} else {
None
}
})
.collect();
let not_west = filter(
&array,
¬_west_bitset
).unwrap();
> Made array of 6666667 Strings not in west
~450 ms
> Made array of 6666667 Strings not in west
~50 ms
+
33. More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int8Builder::new();
let mut builder =
StringDictionaryBuilder::new(vb,kb);
(0..NUM_TAGS)
.enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append(location).unwrap();
});
let array = builder.finish();
> total size: 10000688 bytes
10MB
250 ms
+
dictionary
"us-east"
"us-midwest"
"us-west"
Location
0
1
2
0
1
2
0
1
2
[0]
[1]
[2]
[u8]
35. SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
feature = "simd"))]
fn simd_compare_op<T, F>(left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray>
where
T: ArrowNumericType,
F: Fn(T::Simd, T::Simd) -> T::SimdMask,
{
// use / error checking elided
let null_bit_buffer = combine_option_bitmap(
left.data_ref(), right.data_ref(), len
)?;
let lanes = T::lanes();
let mut result = MutableBuffer::new(
left.len() * mem::size_of::<bool>()
);
let rem = len % lanes;
for i in (0..len - rem).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = op(simd_left, simd_right);
T::bitmask(&simd_result, |b| {
result.write(b).unwrap();
});
}
Source: arrow/src/compute/kernels/comparison.rs
if rem > 0 {
let simd_left = T::load(left.value_slice(len - rem, lanes));
let simd_right = T::load(right.value_slice(len - rem,
lanes));
let simd_result = op(simd_left, simd_right);
let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize;
T::bitmask(&simd_result, |b| {
result.write(&b[0..rem_buffer_size]).unwrap();
});
}
let data = ArrayData::new(
DataType::Boolean,
left.len(),
None,
null_bit_buffer,
0,
vec![result.freeze()],
vec![],
);
Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data)))
}
36. Other things needed in a database
Vec<Option<String>> to support nulls
Handle other data types with same code
Vectorized implementations of filter, aggregate, etc
Persist it to storage
Send data over the network
Ecosystem compatibility
...