There are many ways to model data for analytics (e.g., star schema), but when it comes to MariaDB ColumnStore you no longer have to model and index data for specific queries. However, you have to keep in mind the data is stored by column and distributed. Senior Software Engineer, Assen Totin, shares best practices for data modeling with MariaDB ColumnStore, noting what works really well and when, and what doesn’t.
2. WHY ARE WE HERE?
● If working with a transactional (row) storage is like driving a car (and almost as
ubiquitous)...
● … then working with an analytical storage is driving a trailer.
● Bottom line: change your driving attitude or you’re not going to make it even out
of the parking lot!
3. QUICK SUMMARY
● The analytical workload.
● MariaDB ColumnStore brief.
● ColumnStore data modelling: preparing data for loading, preparing appropriate
schema, optimizing the queries and finding your way around them.
● Moving data to ColumnStore: usage scenarios.
● Q & A
5. THE ANALYTICAL WORKLOAD
● Relatively small set of functions needed, compared to general-purpose
scientific work.
● If needed, the business logic can be moved outside the data storage. Thus the
storage can be reduced to its most basic storing and retrieval functions.
● Data is mostly historic, hence time-sequenced, almost exclusively appended
and rarely – if at all – updated. Data is almost never deleted.
● Large sets of data are retrieved as batches, often full columns or continuous
parts of such.
7. COLUMNSTORE STORAGE
● Dedicated columnar
storage
● Data is organised in
a hierarchical
structure (unlike flat
row-based storages)
8. COLUMNSTORE STORAGE
● Each database is a directory,
each table is a directory inside it,
each column is a file inside it
● Columns are split into multiple
files (extents) of equal size (8M
cells)
● Optional compression, defined
per-table
9. COLUMNSTORE STORAGE
● Data can be loaded (written) directly into extents.
● Completely bypasses the SQL layer, leaving it free
to process queries.
● Once writing completes, we then notify the
processing engine that new data is available.
10. COLUMNSTORE STORAGE
● For each extent some meta-data is calculated and stored in memory (MIN and
MAX values etc.).
● Divide-and-conquer strategy to queries: eliminate all unnecessary extents and
only load the one needed.
12. COLUMNSTORE CLUSTER
● In ColumnStore, a module is a set of running processes.
● Two types of modules (nodes), User Module (UM) and Performance Module
(PM).
● User Module: provides client connectivity (speaks SQL) and has local storage
engines (InnoDB, MyISAM...). More UM = more concurrent connections and
HA. UM can be replicated.
● Performance Module: stores actual data. More PM = more data stored.
● For dev purposes, one UM and one PM may live together in a single OS.
13. COLUMNSTORE TABLES
● ColumnStore is a storage engine in MariaDB.
● To create a ColumnStore table, use
CREATE TABLE… ENGINE=ColumnStore
● Just as with any other MariaDB server, you can mix-and-match different storage
engines in one database.
● Just as with any other MariaDB server, you can do a cross-engine JOIN
between ColumnStore tables and tables in local storage engines on the UM.
14. COLUMNSTORE DATA DISTRIBUTION
● ColumnStore tables are always distributed (assuming more .than one PM).
● ColumnStore distributes data across the PM nodes in round-robin fashion.
● When a new (empty) PM is added, it receives data until its size catches up with
other PM.
● Manual control over data distribution is possible when side-loading via the Bulk
Load API: cpimport modes 2 & 3.
17. NO INDICES, PLEASE!
● ColumnStore has no indices: with big data indices do not fit into memory, so
they become useless.
● This helps to reduce I/O drastically; ColumnStore I/O requirements are
significantly lower than for row storage (works very well on spinning media).
● Reduce CPU load previously spent on maintaining indices.
● The filesystem is always in-sync: file-level backup in real-time is again possible
and natural.
● Direct injection of data into the storage (bypassing SQL layer) is now possible.
● Instead of indices, ColumnStore uses divide-and-conquer to only load what’s
needed to serve a query.
18. PREPARING DATA FOR LOAD
● ColumnStore will append the data in the order we send it, so it is up to us to
order it.
● In order for the divide-and-conquer approach to work best, data has to be
arranged in sequential fashion (because then the most extents can be
eliminated before actual data read from disk begins).
● Examine your data and identify columns with incremental (or time-based)
ordering.
● Examine your queries and find which of these columns is most often used as
predicate.
● Order the data by this column prior to loading it.
19. CLUSTERING THE SCHEMA
● ColumnStore follows the map/reduce approach: each PM does the same work
on its part of the data (map), then all results are aggregated by a UM (reduce).
● To distribute a JOIN (push-down to all PM) one needs to ensure that either
– each node has one of the sides in full, or
– both sides are partitioned by the same key.
● With automated data distribution, ColumnStore finds the smaller side of the
JOIN and redistributes it on-the-fly to facilitate a distributed JOIN. If the smaller
side is bigger than a threshold, the JOIN is pushed up to the UM (which
requires more RAM).
20. CLUSTERING THE SCHEMA
● The optimal ColumnStore schema will thus consist of small number of big
tables and larger number of smaller tables so that JOIN between a big and a
small table can be distributed.
● This schema assumes high degree of data normalisation, so that big tables will
contain as many as possible references to small tables, from which actual
values are derived.
● This schema is usually referred to as star schema: one big table (in the centre)
linked to multiple small tables (around it).
22. CLUSTERING THE SCHEMA
● The big table in the centre is called fact table, because it contains data (rows),
related to events (facts) that occurred in different moments in time. These facts
are often associated with the technical or business activity that is represented
by the schema (e.g., each sale could be a fact, registered in one row; or, each
reading of a sensor value in an IoT system etc.).
● The fact table is amended in each new data load (new rows = new events).
● New rows are appended to the end of the fact table.
● Generally, older (time-wise) facts precede the newer ones.
23. CLUSTERING THE SCHEMA
● The small tables that are linked to the fact table are called dimension tables,
because they contain data that describes properties of the facts.
● Dimension tables constitute of things like nomenclatures and other nearly-
immutable data: e.g., the list of states and cities, list of points of sale etc.
● Dimension tables are rarely amended.
24. CLUSTERING THE SCHEMA
● Having a second layer of links may provide a more complicated design,
sometimes called snowflake schema.
● In a multi-tier (snowflake) schema, a table may be a dimension to one level and
a fact to another, e.g. the list of telco subscribers may be a fact (linked to
dimensions like the subscription plan), but also a dimension (to which the list of
phone calls links).
26. OPTIMIZING THE QUERIES
● An important prerequisite for properly designing a schema is to know how it is
going to be used.
● Ensure the queries and the star schema match each other.
● Always JOIN a fact table to a dimension table only. Never JOIN two fact tables!
● As each column is a separate set of files, the more columns are requested in
the result set, the more data has to be read from the disk; always specify exact
columns and only those needed; never do SELECT (*)
27. OPTIMIZING THE QUERIES
● Filter on sequential columns as much as possible.
● Filter on actual values, not on functions, because functions prevent extent
elimination and lead to full column scan; make extra separate columns if
needed, e.g. have a separate column year instead of YEAR(date).
● ORDER BY and LIMIT are run last and always on the UM, so can be expensive
(depending on amount of data).
● JOIN with a table from a local storage engine (InnoDB, MyISAM...) is done by
first fetching the local table from UM. As this requires loopback connection, this
is often relatively slow – so consider its usage carefully.
28. OPTIMIZING DIMENSIONS
● Keep dimensions small (up to 1M rows) as they will be redistributed on-the-fly
for each JOIN.
● Increase the distributed JOIN threshold for bigger dimensions (but carefully).
This is a cluster-wide tunable from Columnstore.xml.
29. EXTENDING COLUMNSTORE ENGINE
● ColumnStore engine might not always be the best choice (e.g., data type
support,encoding support etc.).
● Local storage engines on UM may supplement the ColumnStore engine via
cross-engine JOIN.
● Usually multiple UM will be replicated, so tables from local storage engines are
also replicated… but in some special cases you may want not to replicate them
and effectively have different content for the same local table on different UM;
in this case, make sure to configure jobs to run only on the UM where
connected (access to ExeMgr process).
30. TRACING YOUR STEPS: EXPLAIN
● EXPLAIN works for ColumnStore, but is less useful (no indices)
SELECT t.customer_id, t.discount, t.discounted_price
FROM transactions t
JOIN books b ON b.book_id=t.book_id
WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
MariaDB [bookstore]> EXPLAIN SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON
b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; |
| | | | | | | | | | Using join buffer (flat, BNL join) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
31. TRACING YOUR STEPS: STATS
● Use SELECT calGetStats() stored
procedure, which provides statistics
about resources used on the User
Module (UM) node, PM node, and
network by the last run query.
● 582979 rows in set (3.373 sec)
MariaDB [bookstore]> calGetStats();
+---------------------------------+
| Query Stats: |
| MaxMemPct-3; |
| NumTempFiles-0; |
| TempFileSpace-0B; |
| ApproxPhyI/O-71674; |
| CacheI/O-47150; |
| BlocksTouched-47128; |
| PartitionBlocksEliminated-1413; |
| MsgBytesIn-37MB; |
| MsgBytesOut-63KB; |
| Mode-Distributed |
+---------------------------------+
32. TRACING YOUR STEPS: TRACE
● To trace a query, first enable tracing with SELECT calSetTrace(1), then run
the query, then get the trace with SELECT calGetTrace().
MariaDB [bookstore]> SELECT calGetTrace();
+-------------------------------------------------------------------------------------------------------------------+
| Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows |
| BPS PM b 301760 (book_id) 0 7 0 0.002 5001 |
| BPS PM t 301805 (book_id,customer_id,discount,discounted_price,trans_date) 0 17280 1413 0.308 582979 |
| HJS PM t-b 301805 - - - - ----- - |
| TNS UM - - - - - - 2.476 582979 |
+-------------------------------------------------------------------------------------------------------------------+
34. MOVING DATA TO COLUMNSTORE
● Scenario A: Use the same schema as in transactional.
● Only use ColumnStore as long-term cold storage for large amounts of data.
● No OLAP as schema does not match requirements.
● No OLTP as data is too big.
● Copy selected parts of the data back to OLTP engine for processing.
35. MOVING DATA TO COLUMNSTORE
● Scenario B: Use dedicated star schema.
● Actively use ColumnStore as OLAP backend.
● Load the data from OLTP storage in batches: ETL with either LOAD DATA or
Bulk Load API (preferred: cpimport, shared library/JAR).
● Use any preferred front-end tool to drive the analytics (Tableau, Pentaho
Mondrian, Microsft SSAS, Apache Zeppelin…).