Alluxio Global Online Meetup
August 25, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Abner Ferreira, Simbiose Ventures
Caio Pavanelli, Simbiose Ventures
Bin Fan, Alluxio
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery's tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem;
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
3. Agenda
• Overview about our company (Simbiose)
• Overview of our product (StorageQuery)
• Alluxio usage on StorageQuery
• Purpose and advantages
• Technical Integration
• Learning Alluxio the hard (or easy) way
• Alluxio Debug Log (based on AspectJ)
• Alluxio usage on other Simbiose’s projects
4.
5. Who are we?
• A team focused primarily in human growth and behavioral development
• We are currently working on the development of new solutions for data technologies
6.
7. Get to know StorageQuery
Logical Queries
Query files directly from the
source with Presto. Suited tool
for data lakes and log analysis.
Fully Managed
No need to worry about setting
up or managing servers. Your
only focus is querying files.
Plug and Play
Access keys are the only
requirement to make queries on
your object stores.
Access Control
Use fine-grained permissioning
options and get more security
and governance for your data.
A solution for making interactive queries on files from multiple buckets
without having to set up any infrastructure.
8. Use Cases
Data Lake
Store high volumes of raw data
and allow users to query it using
our tools or external solutions.
Log Analysis
Analyze machine data using SQL
or our API to obtain interesting
findings in your logs.
Operates with
9. Made for Developers
Java Ruby Python
Compatible with
DevOps
● Users responsible for
infrastructure who can
benefit from log analysis.
● Analyze logs directly from
cloud storages without
losing time with setup and
management.
DataOps
● Users responsible for data
management who can
benefit from data lakes.
● Granular permissioning
options allow safe data
sharing among multiple
users.
Advantages
● Query your data using SQL
or through our API.
● Connection to the most
common analysis tools
using JDBC drivers.
● Automatic schema
detection.
Data Formats
● CSV
● Parquet
● Avro
● ORC
12. Use Cases
What it does
• One aspect is to cache buckets from the original sources in order to optimize queries using the provided
access token.
How we use it
• Queries are made with SQL using Presto alongside Alluxio. This allows files from different buckets to be
queried using join clauses, for instance.
Why we use it
• Alluxio allows us to use an already tested and working tool for caching buckets from the Cloud and query
data with Presto without having to customize it for every Object Store.
14. Improving Alluxio
Backblaze B2 source
• Our team created the official Backblaze B2 source (Under FS Storage)
• PR #11708 - Currently pending final corrections to be merged.
• Read and write support (not read only)
16. Learning Alluxio
Main Goals
• Deeply understand Alluxio’s architecture
• Help Alluxio’s team optimize for amount of files and read and
write latency
• Add new sources (such as Backblaze B2)
• Improve multi-tenancy support
17. Alluxio Debug Log
AspectJ
• Aspect-oriented programming extension for Java.
• Aims to increase modularity by allowing the separation of
cross-cutting concerns.
• Allows the implementation of additional behavior to existing
code without modifying the code itself.
18. Alluxio Debug Log
Overview
• Library that uses AspectJ to create a logging system for any specified
Alluxio flows and/or processes.
• This dependency generates logs for any specified process/flow in
Alluxio and prints out all methods within a provided range (start and
finish). (E.g., whenever fs mount <args> is executed, all methods
triggered by this command are printed to the console, identified as
“FsMountFlow.”
• In order to enable this functionality, activate the Maven profile,
“aspectj-logging,” when building Alluxio.
23. Alluxio Debug Log
FSMount - Future
## Step 1 - public void mount(final AlluxioURI alluxioPath, final AlluxioURI ufsPath, final MountContext context) -
**METHOD IN A LOOP FLOW**
**File**: alluxio/master/file/DefaultFileSystemMaster.java
**Javadocs**: NEW DOCS WILL APPEAR HERE
## Step 2 - private MountPResponse lambda$9(final MountPRequest arg0)
**File**: alluxio/master/file/FileSystemMasterClientServiceHandler.java
**Javadocs**: NEW DOCS WILL APPEAR HERE
## Step 3 - public void mount(final MountPRequest request, final StreamObserver<MountPResponse>
responseObserver)
**File**: alluxio/master/file/FileSystemMasterClientServiceHandler.java
**Javadocs**: NEW DOCS WILL APPEAR HERE
25. ShannonDB
Overview
• ShannonDB is an OLAP database created by Simbiose
Ventures in 2015.
• Its main advantage lies on its ability to highly compress
data, almost touching the theoretical limit proposed by
Claude Elwood Shannon (1916 – 2001), the “father of
information theory.”
• ShannonDB is used extensively throughout Simbiose
Ventures’ infrastructure, because it allows data to be
stored with lower costs and queries to be made with very
little latency.
Claude Shannon
Claude Elwood Shannon was an
American mathematician,
electrical engineer, and
cryptographer noted for having
founded information theory.
26. ShannonDB
Alluxio usage on ShannonDB
• Storing files with user-provided data in Object Stores.
• Upon schema creation, the user is able to choose which of the
available object stores will be used for storing the files with data for
the new scheme.
• Alluxio as a caching tool for files in Object Stores as a means to
optimize query latency.