Minerva is a storage plugin of Drill that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk.
Visit https://github.com/bdchain/Minerva to learn more and try it out!
2. 1. Pinpoint the real address of a dataset, typically an HTTP link;
2. Download the dataset in a client-server mode;
3. Configure a computation environment for big data analysis;
4. Preprocess the dataset (e.g. converting file formats) and
develop data analysis algorithms.
A Present-day Workflow:
Problems with Public Dataset Analytics
3. 1. Pinpoint the real address of
a dataset;
2. Download the dataset;
3. Set up a computation environment
powerful enough for big data analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
Links may expire over time due to
temporary server failure or
permanent website shutdown.
Dataset might be polluted (no clue
whether it is the right dataset in your
need).
A single website cannot host all the
datasets.
Problems with Public Dataset Analytics
4. 1. Locate the dataset, typically via an
HTTP link;
2. Download the dataset in a
client-server mode;
3. Set up a computation environment
powerful enough for big data analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
Datasets are usually huge,
demanding a long downloading time;
Client-server mode is not bandwidth
efficient;
Data files are usually packaged and
compressed in a single dataset
archive. A user interested in a part
of the dataset has to download all.
Problems with Public Dataset Analytics
5. 1. Locate the dataset, typically via an
HTTP link;
2. Download the dataset;
3. Configure a computation
environment for big data
analysis;
4. Prepare the data, e.g. converting file
formats, implementing basic analysis
algorithms.
Workflow: Caveats:
Expensive storage and computation
resources are necessary for large-
scale data analytics;
Maintenance and management
overhead consume enormous
human resources.
Problems with Public Dataset Analytics
6. 1. Locate the dataset, typically via an HTTP
link;
2. Download the dataset;
3. Set up a computation environment
powerful enough for big data analysis;
4. Preprocess the dataset (e.g.
converting file formats) and
develop data analysis
algorithms.
Workflow: Caveats:
Datasets from different origins and
different areas of research come in
different formats and structures.
The users of datasets might not be
proficient in programming;
Repetitive work in data analytics is
inevitable when many users happen
to process the same dataset.
Problems with Public Dataset Analytics
7. IPFS1 to the Rescue
• Decentralization: no single point of failure
• Collaboration: sharing resources as well as reusing
codes in the community
• Fine-grained Content addressing2: get exactly what you
need
1: https://ipfs.io/
2: datasets can be split into blocks and only those of interest need processing.
8. Drill1 the Distributed Query Engine
• Compatibility: supporting standard SQL statements
• Flexibility: no metastore, no schema, non-relational data
• Scalability: enabling user defined functions
• Locality-awareness: pushing processing into the nearby
datastores
1: https://drill.apache.org/
9. Drill and IPFS Combined
Drill and IPFS collocation:
A distributed network of nodes, each of which runs Drill and
IPFS simultaneously.
Localhost
Peers on
network
P2P Network
Storage
Planner
Reader /
Writer
Query engine Version &
format
management
Qri1
2
1: https://qri.io/
2: https://libp2p.io/
10. Query Explained: Read
SQL input
= ?
IPFS CID1 of
the dataset
being queried SQL statement that “reads” data:
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
Drill query
interface
1: Content Identifier, CID. https://github.com/ipld/specs/blob/master/block-layer/CID.md
Foreman
11. Query Explained: Read
SQL input
= ?
IPFS object resolution:
ipfs object links QmAce…f2a
Links – CIDs of objects
(chunks) contained in
the “top” object
Foreman
12. Query Explained: Read
SQL input
= ?
DHT
A
D
C
B
IPFS provider resolution:
ipfs dht findprovs QmFHq…32T
A
D
B
C
Drillbits running IPFS
that can provide the
data pieces
Drill execution
plan sent to
peer nodes
Foreman
14. Query Explained: Write
A
D
B
C
SQL input
Result
SQL statement that “writes” data:
CREATE IPFSTABLE ipfs.`create` AS (
SELECT *
FROM ipfs.`/ipfs/QmAce…f2a/employee.json`
ORDER BY `id` DESC
)
DHT
A
D
C
B
Partial CIDs reassembled
into a single CID and
returned to the user
Actual data operations
happen on the node that
stores the data locally
Partial CIDs of new
data pieces sent
back to foreman
Foreman
15. User Defined Functions
• Format conversion programs and common analysis
algorithms can be implemented in the form of User
Defined Functions (UDF) and distributed along with the
datasets.
• Drill can invoke these UDFs using their CIDs, in the same
way it locates a dataset on IPFS.
18. Performance Evaluation
• A 6-node cluster on a cloud service provider, each
with 8GB RAM and 4 cores CPU
• IPFS running in private network mode
• Query file size:100MB-1GB
• Query: simple queries like select *, select count(*)
• Response time:2-10s
• Transactions per second:~10
20. Possible Applications
• An easy MPP cluster with Minerva
• Decentralized data sharing system
• Big data analysis for other Dapps running on IPFS
21. Problems To Be Solved
• Performance
• DHT operations take too much time, especially on the
Internet.
• IPFS limits blocks to be 4MB at max, resulting in
enormous number of blocks for huge datasets.
• Write operations are incomplete
• The last step to reassemble the partial CIDs is not yet
implemented.
• Stability
22. THANK YOU FOR YOUR TIME!
Github: github.com/bdchain/Minerva