Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption.
In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend.
We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.
2. Agenda
Getting started with Datawave
● Essential Knowledge
● Expectations:
○ What it is / What it isn’t
○ Use Cases / Not-so-great use cases
● What is this stuff? (High Level Architecture)
● Key Concepts
● Getting Started with DataWave Quick Start
● The good stuff (Ingest / Query / Table Structures)
● What to try next
5. What is Datawave?
Storage & Retrieval engine built on Apache Accumulo
● Provides an Ingest Workflow & MapReduce Framework
● Exposes an API framework for Query, Analytics
● Performs Query Parsing, Planning and Execution
● Supports a variety of data types and formats
6. Setting Expectations
Datawave is not any of these things
● A SQL Database
○ No SQL language support, no ‘tables’ per se.
● A NoSQL Database
○ No generic put / get API operations a la S3 or MongoDB
● A Search Engine
○ Supports full-text retrieval but implements no relevance model
Note: Any of these could be implemented using Datawave, but
none were required for the use cases it currently supports
7. Datawave Overview
● Storage & Retrieval engine built on Apache Accumulo
Ingest Query
text
image
html
...
Various structured and unstructured data formats
csv
Analytics
Graph
Search
Datawave Datawave
8. warehouseingest
Datawave
Ingest
Datawave Architecture
YARN Wildfly
Datawave Web
Accumulo
Datawave
Iterators
MapReduce Zookeeper
Datawave
Flag
Maker
Ingress
(HAProxy
)
Dataflow
(NiFi)
Many Data Sources / Multiple Hadoop Clusters / Multiple HDFS Instances / Many Web Front-ends
Datawave
Bulk Loader
HDFS
Datawave
Query
Datawave
Ingest
Job
Datawave
Tables
text
html
image
...
csv Datawave
REST API
9. ● Query Parsing, Planning and Execution
○ Extended Boolean query syntax with additional operators
○ JEXL or Lucene query syntax with functions
○ Query execution shaped by data characteristics
○ Iterators to perform low level functions (e.g: Intersection)
○ REST API
Datawave Query
FIELD == ‘VALUE’ AND FOO == ‘BAR’
field:value AND foo:bar
Query/create
Query/id/next
Query/id/close
POST
Parses, validates, optimizes query
Considers field cardinality, index, ...
GET
Returns pages of results
REST endpoints
POST
Frees up Resources
11. Data Model
● Records
○ May be referred to as a RawRecordContainer, Event or Document.
○ A collection of fields and content data
● Fields
○ One Record may have multiple fields, fields can be multi valued, each
value the same type
○ Fields can be indexed for query or simply stored.
○ Special fields such as index-only, tokenized, and virtual fields
● Field Types & Normalizers
○ Raw field data into index entries using a Normalizer.
○ Normalizers exist for numbers, ip addresses, geospatial coordinates
and produce lexically sortable values.
○ Fields are stored in both normalized and non-normalized form in
Accumulo.
Record Field
Field
Type
Normalizer
12. ● Overall Structure
○ Data is partitioned into shards
○ Each Shard is contained in a single tablet
○ Each Shard has its own Field & Term Index, Data & Record Storage
○ Shards are inherently date-based
○ Certain Field values tracked in Edges
Shard Table
Tables, Indexes & Shards
Global Index Table
Shard
Field Index
Record Storage
Data Storage
Term Index
Edge
Table
Shard ShardShard
Meta
Data
Table
13. ● Datawave Metadata
○ Tracks fields, data types, normalizers.
○ (distinct from accumulo.metadata)
● Global Index
○ Index of data per shard, e.g: which shards contain the word ‘nutella’
● Field Index
○ Index of terms within a shard, e.g: which records in shard contain
‘nutella’
● Term Index
○ Index of terms within a record, e.g: which fields contain ‘nutella’,
and its position
Metadata & Indexes Shard Table
Field Index
Record Storage
Data Storage
Term Index
Meta
Data
Table
Global Index Table
14. Records & Edges
● Shard Record / Event Storage
○ All fields for a single object
○ Raw form across multiple keys.
○ Used for object metadata display.
● Shard Data Storage
○ All content for a single object in a single K/V pair
○ Lookup content by id.
● Edges
○ Bipartite Relationship between field values in a record.
○ Allows all records with the same field values to be grouped together.
○ Different density characteristics from Record storage.
○ Supports Iterative graph building
Shard Table
Field Index
Record Storage
Data Storage
Term Index
Edge
Table
Global Index Table
15. Query Abstractions
Query Syntax vs. Query Logic
● Syntax: How a query is ‘described’ to Datawave
○ E.g.: Lucene vs. JEXL
● Logic: How a query is ‘executed’ in Datawave
○ E.g.: Which tables, what the plan is, what results?
■ EventQuery - record oriented retrieval
■ LookupUUID - find records given an ID
■ EdgeQuery - find records given edge members and attributes
■ DiscoveryQuery - record counts by attribute
■ MetricsQuery - find query metrics
Query/create
Query/next
Query/close
16. Analytics
● Implemented as Query Logics
○ Multi-step Query Execution: EdgeEventQuery
○ Iterative Query Execution: Graphs, etc.
■ Challenge: Implement ‘6-degrees of Kevin Bacon’ using Edges
● Implemented as MapReduce Jobs
○ See BulkResults / MapReduce API
○ Supports one-off jobs or oozie workflows
■ Challenge: Implement an Apache Spark runner.
18. Datawave Quickstart
A quick path to getting a working DataWave
● Single node / Virtualbox VM
● Downloads, installs, configures almost everything
○ Java, Maven, Hadoop, Zookeeper, Accumulo, Wildfly
○ Datawave Ingest / Datawave Web
● Test Framework, Sample Queries
● Test SSL Certs
○ Install client certs - See quickstart reference page
● Prepares you for the guided tour of ingest and query.
● The troubleshooting guide is your friend.
19. Quickstart Anatomy
A good way to understand how things fit together
● A general purpose framework for managing ‘services’
○ Java, Maven, Hadoop, Accumulo, Datawave
○ Defines the required materials for managing each service.
● .../contrib/datawave-quickstart/bin/env.sh
○ The entrypoint to all this
○ Registers shell functions to manage services
○ Getting around:
■ Tab-complete is your friend.
■ git grep is your friend
20. 1. Create a config.xml for your datatype (myNewType-config.xml)
Define how a given datatype should be processed
Configure:
a. InputFormat (examples: JsonInputFormat, CSVInputFormat, custom)
b. IngestHelper
c. DataTypeHandlers
d. Fields to index
e. Fields to reverse index
f. Different field processors, special handling of certain fields
g. Virtual Fields
...
Ingesting Your Data into Datawave
21. 2. Add the new datatype values to properties and specify which
ingest flow
BULK_INGEST_DATA_TYPES=type2,type3
LIVE_INGEST_DATA_TYPES=type1,type4,myNewType
3. Add to the corresponding FlagMaker.xml
Ingesting Your Data into Datawave
<flagCfg>
<dataName>myNewType</dataName>
<distributionArgs>none</distributionArgs>
<extraIngestArgs>extra ingest args</extraIngestArgs>
<folder>myNewFolderName</folder>
<inputFormat>mySpecialInputFormatForThisNewType</inputFormat>
<ingestPool>myNewIngestPool</ingestPool>
<maxFlags>maximum number of flags for my new type</maxFlags>
</flagCfg>
22. If existing classes don’t meet your needs:
4. Create an InputFormat+RecordReader (impl EventRecordReader)
● Parse raw records from blocks to be passed as k,v pairs to the mapper
5. Create an IngestHelper (ext BaseIngestHelper)
● Parses field names and field values from a single raw record into a Multimap
of fields and values
6. Implement a DataTypeHandler
● Creates Accumulo entries from multimap above
● ShardedDataTypeHandler performs indexing on the RawRecordContainer fields,
creates entries for shard and global index tables
● ContentIndexingColumnBasedHandler for tokenization
Ingesting Your Data into Datawave
24. Other Things to Explore
This only really scratches the surface
● Cached Results
● Authorizations
● Administration
● Metrics
● Age-Off
Coming Soon
● Muchos: https://github.com/apache/fluo-muchos
25. Where to find it
Datawave:
http://code.nsa.gov/datawave
GitHub Project:
https://github.com/NationalSecurityAgency/datawave/
Questions and Pull Requests Welcome
29. Datawave?
● Storage & Retrieval engine built on Apache Accumulo
○ Table Layout & Key Structure for partitioned indexes, data properties
and storage.
● Performs Query Parsing, Planning and Execution
○ Extended Boolean query syntax with additional operators
○ JEXL query syntax with functions
○ Query execution shaped by on data characteristics
○ Iterators to perform low level functions (e.g: Intersection)
● Provides an Ingest Workflow & MapReduce Framework
○ Job shaping using batch sizing for trading latency with throughput
● Exposes an API framework for Query, Analytics
○ REST API
● Supports a variety of data types and formats
○ Strings, Text, Numbers, IP Addresses, Geospatial Coordinates
This gets deleted on
we’ve captured all o
content
30. Datawave Architecture
warehouseingest
YARN
Wildfly
Datawave Web
HDFS
Accumulo
Datawave Ingest
Datawave REST API
Datawave Iterators
Datawave Query
MapReduce
Datawave Ingest Job
Zookeeper
Datawave Flag Maker
Many web front-ends
Many data sources
Potentially Independent
Ingest and Warehouse Clusters
Potentially Multiple HDFS
instances on the Warehouse
Side
Ingress
(e.g: HAProxy)
Dataflow
(e.g: NiFi)
31. Datawave Ingest
● Provides Ingest workflow with MapReduce framework
○ Job shaping using batch sizing for trading latency with throughput
text
image
html
...
csv
HDFS
Datawave
Flag
Maker
M/R Job
M/R Job
M/R Job
M/R Job
...
Datawave
Bulk
Loader
Low Latency
High
Throughput
32. Capabilities of Note
Some unique properties of Datawave
- Accumulo gives us...
- Transparent migration of data across nodes
- Field-level markings for access and data management
- Datawave adds...
- Date-oriented sharding, supporting data age-off
- Multiple Modes of Ingest: Live, Bulk, Mixed.
Hinweis der Redaktion
Drew
Drew
Hannah / Drew?
Drew
Hannah
When you put it all together, it looks something like this. Typically Datawave is fed using Apache Nifi to stage data on HDFS.