2. History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
2
3. History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
2
4. History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
2
5. History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
2
6. History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation of
BigTable
2
7. Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
3
8. Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
databases offer indexes and tables but don’t
scale without significant effort
3
9. Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
databases offer indexes and tables but don’t
scale without significant effort
key-value stores can easily be distributed but
have limited index support over keys and don’t
have support for tabular format out of the box
3
10. Accumulo
Accumulo is a key-value store with support for
tabular data
– keys are columns identifiers, i.e. they uniquely identify a
column of a row
– a row is composed by multiple keys-values grouped by the
prefix of the key, the row id
4
11. Example
EMAIL NAME LASTNAME COMPANY
olismith85@gmail.com Olivia Smith Winsystems
emily.brown@facebook.com Emily Brown Jones Inc.
⇓
KEY (composed by row id and column id) VALUE
olismith85@gmail.comNAME Olivia
olismith85@gmail.comLASTNAME Smith
olismith85@gmail.comCOMPANY Winsystems
emily.brown@facebook.comNAME Emily
emily.brown@facebook.comLASTNAME Brown
emily.brown@facebook.comCOMPANY Jones Inc.
5
12. Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualifier: the column id
column visibility: who can access this column
timestamp: the version of the key
6
13. Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualifier: the column id
column visibility: who can access this column
timestamp: the version of the key
A single key-value is stored as
KEY
VALUE
row id
column
timestamp
family qualifier visibility
6
14. Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
7
15. Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
7
16. Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
7
17. Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
7
18. Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
many others, such as bulk imports, iterators, fault
tolerance, large rows, multiple-batch queries, testing
utilities (mocks, miniclusters) . . .
7
19. Example
we want to store and analyze tweets from all around
the world.
8
20. Example: Tweets analysis
A tweet has the following (simplified) fields
– coordinate: geospatial information composed by longitude
and latitude
– created at: UTC time of the tweet
– id: tweet unique identifier
– user informations, such as
user.id: unique identifier of the user
user.screen name: user name
. . .
– entities such as hashtags, urls. . .
– text: tweet content
– . . .
how do we store this data in Accumulo?
9
22. Example: Tweets analysis
there is no single way to do it, it depends on
the query
two good practices
– work with denormalized data
– specialize tables for each kind of query
10
23. Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
11
24. Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
Not good for other kind of analysis
– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
25. Summary
Accumulo is great for storing large amount of
structured data
Accumulo is good for interactive queries as well
as more batch queries
Accumulo is a low-level system
– NoSQL (that’s not good!), which means no high-level
language to query the data
– a lot of flexibility which can easily backfire
12