SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
AN INTRODUCTION TO
APACHE ACCUMULO
HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED
Donald Miner
CTO, ClearEdge IT Solutions
@donaldpminer
August 5th, 2014
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
-inf to D E to H J to +inf
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Accumulo Master
TabletServer TabletServer TabletServer
ZooKeeper
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
KEY VALUE
Adelaide Bartkowski 91294124
Alyssa Files 491294
Beatriz Palmore 4124124124
Cecilia Ours 419120
Craig Avalos 940124
Dianna Lapointe 4921
Erma Davis 050194
Fermina Smead 10024599949
Garrett Harsh 140095931
Gaylene Sherry 914815
Gilberto Pardue 412414124124
Hui Nodal 962195192
Janell Tomita 12121
Jannette Betters 9192012
Jeana Delk 9120150
Madlyn Radke 4921
Peggie Allis 944944
Rhona Zygmont 123103
Tran Degarmo 9499494
Wilhelmina Papp 11221
Lookup “Garret Harsh”
FAST
Lookup “4921”
SLOW
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
MIT Lincoln Lab study:
100 Million inserts per second using Accumulo
http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
Booz Allen Hamilton study:
942 tablet servers, 7.56 trillion entries, 408TB, 26 hours
94MB/Sec, 15TB/hr, 80million inserts per second
11 tablet servers went down with no interruption
Showed linear scalability for write throughput
22,000 queries per second
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
HBase vs. Accumulo
• Slight differences in visibility labels
• Coprocessors vs. Iterators
• Accumulo has faster write throughput*
• HBase’s reads are faster*
• HBase has more ecosystem integration
• BatchScanner
• Accumulo can shift around locality groups after the fact
• Accumulo has shown to work with no problems at 1,000
nodes (BAH paper). Facebook and others run a “cell”
design for HBase. Largest clusters in the hundreds*.
* We believeDisclaimer: I am biased
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
(admin & developer) | analyst
Column Visibility Syntax
Label Description
A & B Both ‘A’ and ‘B’ are required
A | B Either ‘A’ or ‘B’ is required
A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required
A | (B & C) ‘A’ or ‘B’ and ‘C’ is required
(A | B) & (C & D) ?
A & (B & (C | D)) ?
Patient has schizophrenia: insurer | MD & psych
Patient has stomach ulcers: insurer | doctor
Patient has cavity: insurer | dentist
Patient has consent for general anesthesia: surgeon
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
More cool features
• Constraints: user-defined Java functions that allow or
prevent new writes based on a condition
• Large rows: no limit on data stored in a row
• Multiple masters & FATE: able to execute table operations
in a fault-tolerant manner
• MapReduce InputFormats
• Bulk import utilities: write directly to Accumulo file formats
• Batch scanner: client scans multiple ranges at once
• Batch writer: client buffers and organized data before
writing in parallel
More cool features
• Constraints: user-defined Java functions that allow or
prevent new writes based on a condition
• Large rows: no limit on data stored in a row
• Multiple masters & FATE: able to execute table operations
in a fault-tolerant manner
• MapReduce InputFormats
• Bulk import utilities: write directly to Accumulo file formats
• Batch scanner: client scans multiple ranges at once
• Batch writer: client buffers and organized data before
writing in parallel
More cool features
• Thrift proxy: access Accumulo through Ruby, Python, …
• Monitor page: shows performance, status, errors, more
• Locality groups: group column families together on disk
for performance tuning (changeable later)
• On-HDFS at rest encryption (work in progress)
• Table import and export
More cool features
• Thrift proxy: access Accumulo through Ruby, Python, …
• Monitor page: shows performance, status, errors, more
• Locality groups: group column families together on disk
for performance tuning (changeable later)
• On-HDFS at rest encryption (work in progress)
• Table import and export
Scalability & Performance
• Multiple HDFS volumes: Accumulo can use multiple
NameNodes to store its data
• Master stores metadata in an Accumulo table
• Native in-memory map: data is first written into a buffer
written in C++, outside of Java
• Relative encoding: consecutive keys with the same values
are flagged instead of rewritten
• Scan pipelines: stages of the read path are parallelized
into separate threads
• Caching: data recently scanned is cached
HOW IT WORKS
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Data Model
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public | private 12423523 @donaldpminer
don info height public | private 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
Name email twitter picture height SSN
derek de…@ad….com 9efe23aa… 6’2”
don dm…@cl….com @donaldpminer 5’ 9”
erica @erica aef319eaf…
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Lookup key
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Collection of data that is kept together
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
What the data is
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Who can see the data
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
When the data was created
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
UNIQUENESS
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
SORTED
Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Some piece of information
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Text rowID = new Text(”don");
Text colFam = new Text(”info");
Text colQual = new Text(”picture");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();
Value value = new Value(MyPictureObj.getBytes());
Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);
BatchWriterConfig config = new BatchWriterConfig();
BatchWriter writer = conn.createBatchWriter(”usertable", config)
writer.add(mutation);
writer.close();
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Writing data into Accumulo
New
Record
Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
sorted
append
Writing data into Accumulo
New
Record
Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
sorted
Minor Compaction
Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
RFile
(minc)
Minor Compaction
Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
RFile
(minc)
RFile
(minc)
Minor Compaction
Writing data into Accumulo
RFile
(majc)
RFile
(minc)
RFile
(minc)
RFile
(minc)
sorted
Major Compaction
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public
Reading data
Range Family Visibilities
don-don info public
Reading data
Authorizations auths = new Authorizations("public”);
Scanner scan = conn.createScanner(”usertable", auths);
scan.setRange(new Range(”don",”don"));
scan.fetchFamily(”info");
for(Entry<Key,Value> entry : scan) {
String row = entry.getKey().getRow();
Value value = entry.getValue();
}
Reading data
MemTable RFile
(minc)
RFile
(minc)
RFile
(minc)
RFile
(majc)
Range Family Visibilities
don-don info public
Tablet: c - f
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public, user, tech
Reading data
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
don-don public, user, tech
Reading data Scan
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
d-e public, user, tech
Reading data Scan
Iterators
• Iterators run tablet server side at these times:
1. Scan Time
2. Minor Compaction
3. Major Compaction
• Multiple iterators are included with Accumulo
• Custom iterators can be created using the Iterator API
Scan Time Iterator
Minor Compaction Iterator
Major Compaction Iterator
Age-Off Iterator
Ro
w
ID
Column
Family
Column
Qualifier
Colum
n
Visibilit
y
Timestam
p
Valu
e
bob attribute score public 1005 24
bob attribute score public 1004 55
bob attribute score public 1003 71
bob attribute score public 1002 66
bob attribute score public 1001 39
bob attribute score public 1000 33
Current Time: 1102
Entries < 100s old
Entries > 100s old
Scan time: server side filtering Major compaction time: age off
Combiner Iterators
Apply a function to all available versions of a particular key
Row
ID
Column
Family
Column
Qualifier
Column
Visibility
Time
Stamp
Value
bob attribute score public 1005 33
bob attribute score public 1004 65
bob attribute score public 1003 71
bob attribute score public 1002 59
bob attribute score public 1001 57
bob attribute score public 1000 51
MAX 71
Scan time: server side combining Minor & Major compaction time: consolidation
USE CASES
Basic Structured Data
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
bob attribute surname public Jul 2013 doe
bob attribute height public Jun 2012 5’11”
bob insurance dental private Sep 2009 MetLife
jane attribute bloodType public Jul 2011 ab-
jane attribute surname public Aug 2013 doe
jane contact cellPhone public Dec 2010 (808) 345-
9876
jane insurance vision private Jan 2008 VSP
john allergy major private Feb 1988 amoxicillin
john attribute weight public Sep 2013 180
john contact homeAddr public Mar 2003 34 Baker LN
Indexing Everything
Row
ID
Column Fam Column Qual Visibility Time value
index Column Fam Column Qual:Row ID Visibility Time -
to Column Fam Column Qual:Row ID Visibility Time -
values Column Fam Column Qual:Row ID Visibility Time -
Event Table
Index Table
Index Table
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
(808) 345-
9876
contact cellPhone:jane public Dec 2010 -
180 attribute weight:john public Sep 2013 -
34 Baker LN contact homeAddr:john public Mar 2003 -
5’11” attribute height:bob public Jun 2012 -
MetLife insuranc
e
dental:bob private Sep 2009 -
VSP insuranc
e
vision:jane private Jan 2008 -
ab- attribute bloodType:jane public Jul 2011 -
amoxicillin allergy major:john private Feb 1988 -
doe attribute surname:bob public Jul 2013 -
doe attribute surname:jane public Aug 2013 -
Data Lake
PATIENTS MEDICINES DOCTORS
INDEX
Data Lake
PATIENTS MEDICINES DOCTORS
INDEX
Tell me
everything
you know
of
amoxicillin
amoxicillin
Data Lake
PATIENTS DISEASES DOCTORS
INDEX
amoxicillin
bob:allergy:amoxicillin
larry:takes:amoxicillin
Stomach ulcer:
treatment:amoxicillin
smith:
prescribed:amoxicillinInfection:
treatment:amoxicillin
Diarrhea:
side effect:amoxicillin
Graphs
a
bc
d
e
a b c d e
a - 1
b 1 -
c - 1
d 1 1 - 1
e -
Start Nodes
EndNodes
Row ID Column Family Column Qualifier Value
a edge b 1
a edge d 1
c edge a 1
c edge d 1
d edge c 1
e edge d 1
Term-Partitioned Index
Tablet Server 1
Row ID
Column
Family
Value
baseball document docid_3
baseball document docid_2
bat document docid_2
Tablet Server 2
Row ID
Column
Family
Value
football document docid_1
football document docid_3
glove document docid_1
Tablet Server 3
Row ID
Column
Family
Value
nba document docid_1
shoes document docid_1
soccer document docid_3
RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]
Tablet Server knows about
the terms “baseball”
Tablet Server knows about
the terms “football”
Tablet Server knows about
the terms “soccer”
Query: “baseball” AND “football” AND “soccer”
Client
Client-side Set
Intersection
[docid_2, docid_3]
[docid_1, docid_3]
[docid_3]
Geospacial Indexing: Z-Order Curve
33.333W, 55.555N = 3535.353535
WHERE TO GO FROM HERE
Resources
Apache Accumulo website
accumulo.apache.org
Accumulo Summit 2014
accumulosummit.com
slideshare.net/AccumuloSummit
Multi-day in-person training
UMBC Training Centers
ClearEdge IT Solutions
Sqrrl
Find a job
AN INTRODUCTION TO
APACHE ACCUMULO
HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED
Donald Miner
CTO, ClearEdge IT Solutions
@donaldpminer
August 5th, 2014

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksCCG
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariDataWorks Summit
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightAmazon Web Services
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 

Was ist angesagt? (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & Databricks
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 

Andere mochten auch

Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick IntroductionJames Salter
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109Sqrrl
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulobusbey
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Accumulo design
Accumulo designAccumulo design
Accumulo designscsorensen
 

Andere mochten auch (8)

Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
OpenStack NSA
OpenStack NSAOpenStack NSA
OpenStack NSA
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109
 
Introduction to Accumulo
Introduction to AccumuloIntroduction to Accumulo
Introduction to Accumulo
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Accumulo design
Accumulo designAccumulo design
Accumulo design
 

Ähnlich wie An Introduction to Accumulo

How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHortonworks
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadooplamont_lockwood
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Get started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesGet started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesJanBask Training
 
The Big Picture on Hadoop
The Big Picture on HadoopThe Big Picture on Hadoop
The Big Picture on HadoopStackIQ
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major projectayk115
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchVMware Tanzu
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseJeffrey T. Pollock
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 

Ähnlich wie An Introduction to Accumulo (20)

How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Get started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesGet started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languages
 
The Big Picture on Hadoop
The Big Picture on HadoopThe Big Picture on Hadoop
The Big Picture on Hadoop
 
Cloud computing major project
Cloud computing major projectCloud computing major project
Cloud computing major project
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
What is apache pig
What is apache pigWhat is apache pig
What is apache pig
 

Mehr von Donald Miner

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital SignsDonald Miner
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataDonald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Mehr von Donald Miner (11)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Kürzlich hochgeladen

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 

Kürzlich hochgeladen (20)

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 

An Introduction to Accumulo

  • 1. AN INTRODUCTION TO APACHE ACCUMULO HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED Donald Miner CTO, ClearEdge IT Solutions @donaldpminer August 5th, 2014
  • 2. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 3. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Adelaide Bartkowski Alyssa Files Beatriz Palmore Cecilia Ours Craig Avalos Dianna Lapointe Erma Davis Fermina Smead Garrett Harsh Gaylene Sherry Gilberto Pardue Hui Nodal Janell Tomita Jannette Betters Jeana Delk Madlyn Radke Peggie Allis Rhona Zygmont Tran Degarmo Wilhelmina Papp
  • 4. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Janell Tomita Jannette Betters Jeana Delk Madlyn Radke Peggie Allis Rhona Zygmont Tran Degarmo Wilhelmina Papp Adelaide Bartkowski Alyssa Files Beatriz Palmore Cecilia Ours Craig Avalos Dianna Lapointe Erma Davis Fermina Smead Garrett Harsh Gaylene Sherry Gilberto Pardue Hui Nodal -inf to D E to H J to +inf
  • 5. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Accumulo Master TabletServer TabletServer TabletServer ZooKeeper
  • 6. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. KEY VALUE Adelaide Bartkowski 91294124 Alyssa Files 491294 Beatriz Palmore 4124124124 Cecilia Ours 419120 Craig Avalos 940124 Dianna Lapointe 4921 Erma Davis 050194 Fermina Smead 10024599949 Garrett Harsh 140095931 Gaylene Sherry 914815 Gilberto Pardue 412414124124 Hui Nodal 962195192 Janell Tomita 12121 Jannette Betters 9192012 Jeana Delk 9120150 Madlyn Radke 4921 Peggie Allis 944944 Rhona Zygmont 123103 Tran Degarmo 9499494 Wilhelmina Papp 11221 Lookup “Garret Harsh” FAST Lookup “4921” SLOW
  • 7. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 8. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 9. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 10. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 11. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • 12. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. MIT Lincoln Lab study: 100 Million inserts per second using Accumulo http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf Booz Allen Hamilton study: 942 tablet servers, 7.56 trillion entries, 408TB, 26 hours 94MB/Sec, 15TB/hr, 80million inserts per second 11 tablet servers went down with no interruption Showed linear scalability for write throughput 22,000 queries per second
  • 13. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 14. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 15. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 16. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 17. HBase vs. Accumulo • Slight differences in visibility labels • Coprocessors vs. Iterators • Accumulo has faster write throughput* • HBase’s reads are faster* • HBase has more ecosystem integration • BatchScanner • Accumulo can shift around locality groups after the fact • Accumulo has shown to work with no problems at 1,000 nodes (BAH paper). Facebook and others run a “cell” design for HBase. Largest clusters in the hundreds*. * We believeDisclaimer: I am biased
  • 18. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 19. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011. (admin & developer) | analyst
  • 20. Column Visibility Syntax Label Description A & B Both ‘A’ and ‘B’ are required A | B Either ‘A’ or ‘B’ is required A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required A | (B & C) ‘A’ or ‘B’ and ‘C’ is required (A | B) & (C & D) ? A & (B & (C | D)) ? Patient has schizophrenia: insurer | MD & psych Patient has stomach ulcers: insurer | doctor Patient has cavity: insurer | dentist Patient has consent for general anesthesia: surgeon
  • 21. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 22. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell- based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
  • 23. More cool features • Constraints: user-defined Java functions that allow or prevent new writes based on a condition • Large rows: no limit on data stored in a row • Multiple masters & FATE: able to execute table operations in a fault-tolerant manner • MapReduce InputFormats • Bulk import utilities: write directly to Accumulo file formats • Batch scanner: client scans multiple ranges at once • Batch writer: client buffers and organized data before writing in parallel
  • 24. More cool features • Constraints: user-defined Java functions that allow or prevent new writes based on a condition • Large rows: no limit on data stored in a row • Multiple masters & FATE: able to execute table operations in a fault-tolerant manner • MapReduce InputFormats • Bulk import utilities: write directly to Accumulo file formats • Batch scanner: client scans multiple ranges at once • Batch writer: client buffers and organized data before writing in parallel
  • 25. More cool features • Thrift proxy: access Accumulo through Ruby, Python, … • Monitor page: shows performance, status, errors, more • Locality groups: group column families together on disk for performance tuning (changeable later) • On-HDFS at rest encryption (work in progress) • Table import and export
  • 26. More cool features • Thrift proxy: access Accumulo through Ruby, Python, … • Monitor page: shows performance, status, errors, more • Locality groups: group column families together on disk for performance tuning (changeable later) • On-HDFS at rest encryption (work in progress) • Table import and export
  • 27. Scalability & Performance • Multiple HDFS volumes: Accumulo can use multiple NameNodes to store its data • Master stores metadata in an Accumulo table • Native in-memory map: data is first written into a buffer written in C++, outside of Java • Relative encoding: consecutive keys with the same values are flagged instead of rewritten • Scan pipelines: stages of the read path are parallelized into separate threads • Caching: data recently scanned is cached
  • 29. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P
  • 30. Data Model Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public | private 12423523 @donaldpminer don info height public | private 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … Name email twitter picture height SSN derek de…@ad….com 9efe23aa… 6’2” don dm…@cl….com @donaldpminer 5’ 9” erica @erica aef319eaf…
  • 31. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Lookup key
  • 32. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Collection of data that is kept together
  • 33. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P What the data is
  • 34. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Who can see the data
  • 35. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P When the data was created
  • 36. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P UNIQUENESS
  • 37. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P SORTED
  • 38. Data Model KEY ROW ID COLUMN FAMILY QUALIFIER VISIBILITY VALUE Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … TIMESTAM P Some piece of information
  • 39. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info SSN private 12314514 123-45-6789 erica … … … … … Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo
  • 40. Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo Text rowID = new Text(”don"); Text colFam = new Text(”info"); Text colQual = new Text(”picture"); ColumnVisibility colVis = new ColumnVisibility("public"); long timestamp = System.currentTimeMillis(); Value value = new Value(MyPictureObj.getBytes()); Mutation mutation = new Mutation(rowID); mutation.put(colFam, colQual, colVis, timestamp, value); BatchWriterConfig config = new BatchWriterConfig(); BatchWriter writer = conn.createBatchWriter(”usertable", config) writer.add(mutation); writer.close();
  • 41. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Row ID Family Qualifier Visibility Timestamp Value don info picture public 13119103 dd3ae1d3b951a33f… Writing data into Accumulo
  • 42. Writing data into Accumulo New Record
  • 43. Writing data into Accumulo Write Ahead Log (WAL) New Record MemTable sorted append
  • 44. Writing data into Accumulo New Record
  • 45. Writing data into Accumulo Write Ahead Log (WAL) New Record MemTable
  • 46. Writing data into Accumulo Write Ahead Log (WAL) New Record MemTable RFile (minc) sorted Minor Compaction
  • 47. Writing data into Accumulo Write Ahead Log (WAL) New Record MemTable RFile (minc) RFile (minc) Minor Compaction
  • 48. Writing data into Accumulo Write Ahead Log (WAL) New Record MemTable RFile (minc) RFile (minc) RFile (minc) Minor Compaction
  • 49. Writing data into Accumulo RFile (majc) RFile (minc) RFile (minc) RFile (minc) sorted Major Compaction
  • 50. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Family Visibilities don-don info public Reading data
  • 51. Range Family Visibilities don-don info public Reading data Authorizations auths = new Authorizations("public”); Scanner scan = conn.createScanner(”usertable", auths); scan.setRange(new Range(”don",”don")); scan.fetchFamily(”info"); for(Entry<Key,Value> entry : scan) { String row = entry.getKey().getRow(); Value value = entry.getValue(); }
  • 52. Reading data MemTable RFile (minc) RFile (minc) RFile (minc) RFile (majc) Range Family Visibilities don-don info public Tablet: c - f
  • 53. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Family Visibilities don-don info public, user, tech Reading data
  • 54. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Visibilities don-don public, user, tech Reading data Scan
  • 55. Row ID Family Qualifier Visibility Timestamp Value derek … … … … … don contact email admin | private 11905014 dminer@gopivotal.com don contact email admin | private 12412412 dminer@clearedgeit.com don contact email public 12412412 dm…@cl....com don contact twitter public 12423523 @donaldpminer don info height public 12314514 5’ 9” don info picture public 13119103 dd3ae1d3b951a33f… don info SSN private 12314514 123-45-6789 erica … … … … … Range Visibilities d-e public, user, tech Reading data Scan
  • 56. Iterators • Iterators run tablet server side at these times: 1. Scan Time 2. Minor Compaction 3. Major Compaction • Multiple iterators are included with Accumulo • Custom iterators can be created using the Iterator API
  • 60. Age-Off Iterator Ro w ID Column Family Column Qualifier Colum n Visibilit y Timestam p Valu e bob attribute score public 1005 24 bob attribute score public 1004 55 bob attribute score public 1003 71 bob attribute score public 1002 66 bob attribute score public 1001 39 bob attribute score public 1000 33 Current Time: 1102 Entries < 100s old Entries > 100s old Scan time: server side filtering Major compaction time: age off
  • 61. Combiner Iterators Apply a function to all available versions of a particular key Row ID Column Family Column Qualifier Column Visibility Time Stamp Value bob attribute score public 1005 33 bob attribute score public 1004 65 bob attribute score public 1003 71 bob attribute score public 1002 59 bob attribute score public 1001 57 bob attribute score public 1000 51 MAX 71 Scan time: server side combining Minor & Major compaction time: consolidation
  • 63. Basic Structured Data Row ID Column Family Column Qualifier Column Visibility Timestam p Value bob attribute surname public Jul 2013 doe bob attribute height public Jun 2012 5’11” bob insurance dental private Sep 2009 MetLife jane attribute bloodType public Jul 2011 ab- jane attribute surname public Aug 2013 doe jane contact cellPhone public Dec 2010 (808) 345- 9876 jane insurance vision private Jan 2008 VSP john allergy major private Feb 1988 amoxicillin john attribute weight public Sep 2013 180 john contact homeAddr public Mar 2003 34 Baker LN
  • 64. Indexing Everything Row ID Column Fam Column Qual Visibility Time value index Column Fam Column Qual:Row ID Visibility Time - to Column Fam Column Qual:Row ID Visibility Time - values Column Fam Column Qual:Row ID Visibility Time - Event Table Index Table
  • 65. Index Table Row ID Column Family Column Qualifier Column Visibility Timestam p Value (808) 345- 9876 contact cellPhone:jane public Dec 2010 - 180 attribute weight:john public Sep 2013 - 34 Baker LN contact homeAddr:john public Mar 2003 - 5’11” attribute height:bob public Jun 2012 - MetLife insuranc e dental:bob private Sep 2009 - VSP insuranc e vision:jane private Jan 2008 - ab- attribute bloodType:jane public Jul 2011 - amoxicillin allergy major:john private Feb 1988 - doe attribute surname:bob public Jul 2013 - doe attribute surname:jane public Aug 2013 -
  • 67. Data Lake PATIENTS MEDICINES DOCTORS INDEX Tell me everything you know of amoxicillin amoxicillin
  • 68. Data Lake PATIENTS DISEASES DOCTORS INDEX amoxicillin bob:allergy:amoxicillin larry:takes:amoxicillin Stomach ulcer: treatment:amoxicillin smith: prescribed:amoxicillinInfection: treatment:amoxicillin Diarrhea: side effect:amoxicillin
  • 69. Graphs a bc d e a b c d e a - 1 b 1 - c - 1 d 1 1 - 1 e - Start Nodes EndNodes Row ID Column Family Column Qualifier Value a edge b 1 a edge d 1 c edge a 1 c edge d 1 d edge c 1 e edge d 1
  • 70. Term-Partitioned Index Tablet Server 1 Row ID Column Family Value baseball document docid_3 baseball document docid_2 bat document docid_2 Tablet Server 2 Row ID Column Family Value football document docid_1 football document docid_3 glove document docid_1 Tablet Server 3 Row ID Column Family Value nba document docid_1 shoes document docid_1 soccer document docid_3 RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3] Tablet Server knows about the terms “baseball” Tablet Server knows about the terms “football” Tablet Server knows about the terms “soccer” Query: “baseball” AND “football” AND “soccer” Client Client-side Set Intersection [docid_2, docid_3] [docid_1, docid_3] [docid_3]
  • 71. Geospacial Indexing: Z-Order Curve 33.333W, 55.555N = 3535.353535
  • 72. WHERE TO GO FROM HERE
  • 73. Resources Apache Accumulo website accumulo.apache.org Accumulo Summit 2014 accumulosummit.com slideshare.net/AccumuloSummit Multi-day in-person training UMBC Training Centers ClearEdge IT Solutions Sqrrl
  • 75. AN INTRODUCTION TO APACHE ACCUMULO HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED Donald Miner CTO, ClearEdge IT Solutions @donaldpminer August 5th, 2014

Hinweis der Redaktion

  1. Two basic operators AND operator represented by & OR operator represented by | In the examples A,B, C, and D are security tokens Security Tokens are strings of alphanumeric characters Tokens are user defined Parenthesis are required to use nested logic
  2. A Minor Compaction is triggered when the Tablet’s MemTable reaches it’s maximum size When the MemTable reaches it’s maximum size, it is flushed A Minor Compaction Iterator is applied during the stage when the MemTable is flushed and a new RFile is created Since the iterator is applied during a Minor Compaction, the iterator does affect the persistence of the data
  3. A Major Compaction periodically merges as set of RFiles into one If a Major Compaction iterator is enabled, the iterator runs after the merge to filter data before writing the new RFile Since the iterator is applied during a Minor Compaction, the iterator does affect the persistence of the data