Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
A short introduction to Vertica
1. A short introduction to Vertica
Tommi Siivola, Software Engineer
RedHat Software Developer Meetup 10.09.2014
2. - Quick orientation
- Columns
- Projections
- Clustering
- Hybrid storage
- Special features
AGENDA
3. Quick orientation to Vertica
- Big data database product from HP
- For handling terabytes/petabytes of data
- Column-oriented
4. Quick orientation to Vertica
- What does that mean in practice?
– Vertica is a relational database
– Supports a subset of ANSI SQL-99 standard
– JDBC/ODBC drivers
– A command line client (vsql)
5. Quick orientation to Vertica
- Runs on major Linux distros (RHEL, Suse, Debian, Ubuntu)
- Amazon AMI available for running in Vertica in the cloud
- Up to 1 TB of data and a cluster of 3 nodes without license
(so called ”Community Edition” mode)
- Larger setups require a license from HP
6. Concepts: column-oriented
- Vertica stores data as columns, instead of each row as unit
– Allows for efficient data compression
– Can skip unwanted columns when querying
– More efficient aggregate value calculations
9. Concepts: column-oriented
SKIP UNWANTED COLUMNS date value id
2014-03-15 23.97 4
2014-03-15 24.51 7
2014-03-15 25.05 6
2014-03-15 25.59 7
2014-03-16 26.13 7
2014-03-16 26.67 4
2014-03-16 27.21 2
2014-03-16 27.75 3
2014-03-16 28.29 7
SELECT value, id FROM table
10. Concepts: projections
- Data physically stored in projections
- Projections similar to materialized views
– Data optimized for querying during insert
- Table has one or more projections
- Projection contains one or more columns
- Data can be duplicated in projections for query efficiency
12. Concepts: clustering
- Parallel processing
– Data segments distributed across cluster nodes
– Performance can be increased by adding hardware
- Reliability (K-safety)
– Tolerates nodes going offline
- All nodes can respond to queries → queries can be load
balanced between nodes
15. Concepts: Hybrid storage
- Read-optimized storage (ROS)
– On disk
– Heavily encoded & compressed
- Write-optimized storage (WOS)
– In memory
– No encoding or compression
16. Concepts: Hybrid storage
- Inserted data is first aggregated in WOS
– Inserting to WOS is faster, due to lack of compression
and disk write overheads
- Background job moves data in batches from WOS to ROS
– Writing to ROS is more efficient in batches
– Querying is more efficient from ROS
17. Vertica feature: Pattern matching
- Example: Finding sequences in
web site log data
- Find all sequences where user
enters the site, browses and
finally makes a purchase
- Difficult to express in SQL
- Vertica has SQL extension for
finding patterns
user action
1 enter
1 browse
1 browse
1 purchase
2 enter
2 browse
3 enter
3 browse
3 purchase
PATTERNS IN DATA
18. Vertica feature: Pattern matching
- Example: find sequences where user enters a site, browses
and makes a purchase
SELECT uid,sid,ts,refurl,pageurl,action,
event_name(),pattern_id(),match_id()
FROM clickstream_log
MATCH
(PARTITION BY uid, sid ORDER BY ts
DEFINE
Entry AS refurl NOT ILIKE '%site.com%' AND pageurl ILIKE '%site.com%',
Onsite AS pageurl ILIKE '%site.com%' AND action = 'V',
Purchase AS pageurl ILIKE '%site.com%' AND action = 'P'
PATTERN
P AS (Entry Onsite* Purchase)
ROWS MATCH FIRST EVENT);
19. Extending Vertica
- Custom SQL functions can be created with R, Java or C++
- R can be used for creating scalar and transform functions
- Java, all of the above + load functions
- C++, all of the above + aggregate and analytic functions
20. Find out more
- Vertica free downloads available at (requires registration)
– my.vertica.com
- Vertica documentation available at (no registration)
– www.vertica.com/documentation
- C-Store research project (Vertica predecessor)
– db.csail.mit.edu/projects/cstore/
21. THANKS!
Tommi Siivola, Software Engineer
tommi.siivola@eficode.com
+358 (0)50 371 9308
eficode.fi
”Automatisoi tai
näivety” ja muita
kirjoituksia
Eficoden blogissa.
EFICODE.FI/BLOGI