SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Introduction to Accumulo
Mario Pastorelli
mario.pastorelli@teralytics.ch
March 7, 2016
1
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed filesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation of
BigTable
2
Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
3
Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
databases offer indexes and tables but don’t
scale without significant effort
3
Distributed Structured Data
structured data should be
– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it has
some kind of “primary key”)
– tabular for easy processing of complex data, each row can
potentially have many columns
databases offer indexes and tables but don’t
scale without significant effort
key-value stores can easily be distributed but
have limited index support over keys and don’t
have support for tabular format out of the box
3
Accumulo
Accumulo is a key-value store with support for
tabular data
– keys are columns identifiers, i.e. they uniquely identify a
column of a row
– a row is composed by multiple keys-values grouped by the
prefix of the key, the row id
4
Example
EMAIL NAME LASTNAME COMPANY
olismith85@gmail.com Olivia Smith Winsystems
emily.brown@facebook.com Emily Brown Jones Inc.
⇓
KEY (composed by row id and column id) VALUE
olismith85@gmail.comNAME Olivia
olismith85@gmail.comLASTNAME Smith
olismith85@gmail.comCOMPANY Winsystems
emily.brown@facebook.comNAME Emily
emily.brown@facebook.comLASTNAME Brown
emily.brown@facebook.comCOMPANY Jones Inc.
5
Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualifier: the column id
column visibility: who can access this column
timestamp: the version of the key
6
Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which “column group” the key belongs to
column qualifier: the column id
column visibility: who can access this column
timestamp: the version of the key
A single key-value is stored as
KEY
VALUE
row id
column
timestamp
family qualifier visibility
6
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
many others, such as bulk imports, iterators, fault
tolerance, large rows, multiple-batch queries, testing
utilities (mocks, miniclusters) . . .
7
Example
we want to store and analyze tweets from all around
the world.
8
Example: Tweets analysis
A tweet has the following (simplified) fields
– coordinate: geospatial information composed by longitude
and latitude
– created at: UTC time of the tweet
– id: tweet unique identifier
– user informations, such as
user.id: unique identifier of the user
user.screen name: user name
. . .
– entities such as hashtags, urls. . .
– text: tweet content
– . . .
how do we store this data in Accumulo?
9
Example: Tweets analysis
there is no single way to do it, it depends on
the query
10
Example: Tweets analysis
there is no single way to do it, it depends on
the query
two good practices
– work with denormalized data
– specialize tables for each kind of query
10
Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
11
Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities”
”hashtags” hashtags
”urls” urls
”text” text
Easy to process the entire timeline or a time
interval for the same user
Not good for other kind of analysis
– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
Summary
Accumulo is great for storing large amount of
structured data
Accumulo is good for interactive queries as well
as more batch queries
Accumulo is a low-level system
– NoSQL (that’s not good!), which means no high-level
language to query the data
– a lot of flexibility which can easily backfire
12
Thank you
Questions?
13

Weitere ähnliche Inhalte

Ähnlich wie Introduction to Accumulo

Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
 
NAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docxNAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docx
rosemarybdodson23141
 
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docx
jasoninnes20
 

Ähnlich wie Introduction to Accumulo (20)

Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Bigtable_Paper
Bigtable_PaperBigtable_Paper
Bigtable_Paper
 
rdbms-notes
rdbms-notesrdbms-notes
rdbms-notes
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
R data structures-2
R data structures-2R data structures-2
R data structures-2
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
 
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year? BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
 
NAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docxNAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docx
 
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docx
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
 
Search Intelligence & MarkLogic Search API
Search Intelligence & MarkLogic Search APISearch Intelligence & MarkLogic Search API
Search Intelligence & MarkLogic Search API
 

Kürzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Introduction to Accumulo

  • 1. Introduction to Accumulo Mario Pastorelli mario.pastorelli@teralytics.ch March 7, 2016 1
  • 2. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: 2
  • 3. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem 2
  • 4. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing 2
  • 5. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data 2
  • 6. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed filesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data Accumulo is an open-source implementation of BigTable 2
  • 7. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns 3
  • 8. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns databases offer indexes and tables but don’t scale without significant effort 3
  • 9. Distributed Structured Data structured data should be – distributed for parallel processing – indexed for fast retrieval (“structured” means that it has some kind of “primary key”) – tabular for easy processing of complex data, each row can potentially have many columns databases offer indexes and tables but don’t scale without significant effort key-value stores can easily be distributed but have limited index support over keys and don’t have support for tabular format out of the box 3
  • 10. Accumulo Accumulo is a key-value store with support for tabular data – keys are columns identifiers, i.e. they uniquely identify a column of a row – a row is composed by multiple keys-values grouped by the prefix of the key, the row id 4
  • 11. Example EMAIL NAME LASTNAME COMPANY olismith85@gmail.com Olivia Smith Winsystems emily.brown@facebook.com Emily Brown Jones Inc. ⇓ KEY (composed by row id and column id) VALUE olismith85@gmail.comNAME Olivia olismith85@gmail.comLASTNAME Smith olismith85@gmail.comCOMPANY Winsystems emily.brown@facebook.comNAME Emily emily.brown@facebook.comLASTNAME Brown emily.brown@facebook.comCOMPANY Jones Inc. 5
  • 12. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which “column group” the key belongs to column qualifier: the column id column visibility: who can access this column timestamp: the version of the key 6
  • 13. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which “column group” the key belongs to column qualifier: the column id column visibility: who can access this column timestamp: the version of the key A single key-value is stored as KEY VALUE row id column timestamp family qualifier visibility 6
  • 14. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast 7
  • 15. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds 7
  • 16. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables 7
  • 17. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data 7
  • 18. Accumulo features range queries: keys are stored in lexicographical order allowing to query “semantically close” data – e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data many others, such as bulk imports, iterators, fault tolerance, large rows, multiple-batch queries, testing utilities (mocks, miniclusters) . . . 7
  • 19. Example we want to store and analyze tweets from all around the world. 8
  • 20. Example: Tweets analysis A tweet has the following (simplified) fields – coordinate: geospatial information composed by longitude and latitude – created at: UTC time of the tweet – id: tweet unique identifier – user informations, such as user.id: unique identifier of the user user.screen name: user name . . . – entities such as hashtags, urls. . . – text: tweet content – . . . how do we store this data in Accumulo? 9
  • 21. Example: Tweets analysis there is no single way to do it, it depends on the query 10
  • 22. Example: Tweets analysis there is no single way to do it, it depends on the query two good practices – work with denormalized data – specialize tables for each kind of query 10
  • 23. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family qualifier visibility user.id + created at + id ”coordinate” lon/lat ”entities” ”hashtags” hashtags ”urls” urls ”text” text Easy to process the entire timeline or a time interval for the same user 11
  • 24. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family qualifier visibility user.id + created at + id ”coordinate” lon/lat ”entities” ”hashtags” hashtags ”urls” urls ”text” text Easy to process the entire timeline or a time interval for the same user Not good for other kind of analysis – find all the tweets with a given hashtag – find all the tweets in New York – . . . 11
  • 25. Summary Accumulo is great for storing large amount of structured data Accumulo is good for interactive queries as well as more batch queries Accumulo is a low-level system – NoSQL (that’s not good!), which means no high-level language to query the data – a lot of flexibility which can easily backfire 12