Introduction to Accumulo

Introduction to Accumulo
Mario Pastorelli
mario.pastorelli@teralytics.ch
March 7, 2016
1

History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
2

History
GFS: distributed ﬁlesystem
2

History
MapReduce: distributed data processing
2

History
BigTable: distributed storage system for
structured data
2

History
BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation of
BigTable
2

Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
3

databases offer indexes and tables but don’t
scale without significant effort
3

databases offer indexes and tables but don’t
scale without significant effort
key-value stores can easily be distributed but
have limited index support over keys and don’t
have support for tabular format out of the box
3

Accumulo
Accumulo is a key-value store with support for
tabular data
– keys are columns identiﬁers, i.e. they uniquely identify a
column of a row
– a row is composed by multiple keys-values grouped by the
preﬁx of the key, the row id
4

Example
EMAIL NAME LASTNAME COMPANY
olismith85@gmail.com Olivia Smith Winsystems
emily.brown@facebook.com Emily Brown Jones Inc.
⇓
KEY (composed by row id and column id) VALUE
olismith85@gmail.comNAME Olivia
olismith85@gmail.comLASTNAME Smith
olismith85@gmail.comCOMPANY Winsystems
emily.brown@facebook.comNAME Emily
emily.brown@facebook.comLASTNAME Brown
emily.brown@facebook.comCOMPANY Jones Inc.
5

Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualiﬁer: the column id
column visibility: who can access this column
timestamp: the version of the key
6

Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualiﬁer: the column id
column visibility: who can access this column
timestamp: the version of the key
A single key-value is stored as
KEY
VALUE
row id
column
timestamp
family qualiﬁer visibility
6

Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
7

Accumulo features
fast: with proper key schemas a query can take
milliseconds
7

Accumulo features
milliseconds
scalable: designed to store huge amount of data over
multiple tables
7

Accumulo features
milliseconds
multiple tables
built-in cache for recently queried data
7

Accumulo features
milliseconds
multiple tables
built-in cache for recently queried data
many others, such as bulk imports, iterators, fault
tolerance, large rows, multiple-batch queries, testing
utilities (mocks, miniclusters) . . .
7

Example
we want to store and analyze tweets from all around
the world.
8

Example: Tweets analysis
A tweet has the following (simplified) fields
– coordinate: geospatial information composed by longitude
and latitude
– created at: UTC time of the tweet
– id: tweet unique identifier
– user informations, such as
user.id: unique identifier of the user
user.screen name: user name
. . .
– entities such as hashtags, urls. . .
– text: tweet content
– . . .
how do we store this data in Accumulo?
9

there is no single way to do it, it depends on
the query
10

there is no single way to do it, it depends on
the query
two good practices
– work with denormalized data
– specialize tables for each kind of query
10

Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
11

Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
Not good for other kind of analysis
– ﬁnd all the tweets with a given hashtag
– ﬁnd all the tweets in New York
– . . .
11

Summary
Accumulo is great for storing large amount of
structured data
Accumulo is good for interactive queries as well
as more batch queries
Accumulo is a low-level system
– NoSQL (that’s not good!), which means no high-level
language to query the data
– a lot of ﬂexibility which can easily backﬁre
12

Introduction to Accumulo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Introduction to Accumulo

Ähnlich wie Introduction to Accumulo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Accumulo