SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Introducing
Drew Farris - Hannah Pellon
Agenda
Getting started with Datawave
● Essential Knowledge
● Expectations:
○ What it is / What it isn’t
○ Use Cases / Not-so-great use cases
● What is this stuff? (High Level Architecture)
● Key Concepts
● Getting Started with DataWave Quick Start
● The good stuff (Ingest / Query / Table Structures)
● What to try next
Essential Knowledge
● Hadoop (HDFS, YARN, MapReduce)
● Accumulo (General Architecture, Iterators,
Authorizations, Shell)
● Zookeeper
● Wildfly
Datawave
Foundations
What is Datawave?
Storage & Retrieval engine built on Apache Accumulo
● Provides an Ingest Workflow & MapReduce Framework
● Exposes an API framework for Query, Analytics
● Performs Query Parsing, Planning and Execution
● Supports a variety of data types and formats
Setting Expectations
Datawave is not any of these things
● A SQL Database
○ No SQL language support, no ‘tables’ per se.
● A NoSQL Database
○ No generic put / get API operations a la S3 or MongoDB
● A Search Engine
○ Supports full-text retrieval but implements no relevance model
Note: Any of these could be implemented using Datawave, but
none were required for the use cases it currently supports
Datawave Overview
● Storage & Retrieval engine built on Apache Accumulo
Ingest Query
text
image
html
...
Various structured and unstructured data formats
csv
Analytics
Graph
Search
Datawave Datawave
warehouseingest
Datawave
Ingest
Datawave Architecture
YARN Wildfly
Datawave Web
Accumulo
Datawave
Iterators
MapReduce Zookeeper
Datawave
Flag
Maker
Ingress
(HAProxy
)
Dataflow
(NiFi)
Many Data Sources / Multiple Hadoop Clusters / Multiple HDFS Instances / Many Web Front-ends
Datawave
Bulk Loader
HDFS
Datawave
Query
Datawave
Ingest
Job
Datawave
Tables
text
html
image
...
csv Datawave
REST API
● Query Parsing, Planning and Execution
○ Extended Boolean query syntax with additional operators
○ JEXL or Lucene query syntax with functions
○ Query execution shaped by data characteristics
○ Iterators to perform low level functions (e.g: Intersection)
○ REST API
Datawave Query
FIELD == ‘VALUE’ AND FOO == ‘BAR’
field:value AND foo:bar
Query/create
Query/id/next
Query/id/close
POST
Parses, validates, optimizes query
Considers field cardinality, index, ...
GET
Returns pages of results
REST endpoints
POST
Frees up Resources
Conceptual
Models
Data Model
● Records
○ May be referred to as a RawRecordContainer, Event or Document.
○ A collection of fields and content data
● Fields
○ One Record may have multiple fields, fields can be multi valued, each
value the same type
○ Fields can be indexed for query or simply stored.
○ Special fields such as index-only, tokenized, and virtual fields
● Field Types & Normalizers
○ Raw field data into index entries using a Normalizer.
○ Normalizers exist for numbers, ip addresses, geospatial coordinates
and produce lexically sortable values.
○ Fields are stored in both normalized and non-normalized form in
Accumulo.
Record Field
Field
Type
Normalizer
● Overall Structure
○ Data is partitioned into shards
○ Each Shard is contained in a single tablet
○ Each Shard has its own Field & Term Index, Data & Record Storage
○ Shards are inherently date-based
○ Certain Field values tracked in Edges
Shard Table
Tables, Indexes & Shards
Global Index Table
Shard
Field Index
Record Storage
Data Storage
Term Index
Edge
Table
Shard ShardShard
Meta
Data
Table
● Datawave Metadata
○ Tracks fields, data types, normalizers.
○ (distinct from accumulo.metadata)
● Global Index
○ Index of data per shard, e.g: which shards contain the word ‘nutella’
● Field Index
○ Index of terms within a shard, e.g: which records in shard contain
‘nutella’
● Term Index
○ Index of terms within a record, e.g: which fields contain ‘nutella’,
and its position
Metadata & Indexes Shard Table
Field Index
Record Storage
Data Storage
Term Index
Meta
Data
Table
Global Index Table
Records & Edges
● Shard Record / Event Storage
○ All fields for a single object
○ Raw form across multiple keys.
○ Used for object metadata display.
● Shard Data Storage
○ All content for a single object in a single K/V pair
○ Lookup content by id.
● Edges
○ Bipartite Relationship between field values in a record.
○ Allows all records with the same field values to be grouped together.
○ Different density characteristics from Record storage.
○ Supports Iterative graph building
Shard Table
Field Index
Record Storage
Data Storage
Term Index
Edge
Table
Global Index Table
Query Abstractions
Query Syntax vs. Query Logic
● Syntax: How a query is ‘described’ to Datawave
○ E.g.: Lucene vs. JEXL
● Logic: How a query is ‘executed’ in Datawave
○ E.g.: Which tables, what the plan is, what results?
■ EventQuery - record oriented retrieval
■ LookupUUID - find records given an ID
■ EdgeQuery - find records given edge members and attributes
■ DiscoveryQuery - record counts by attribute
■ MetricsQuery - find query metrics
Query/create
Query/next
Query/close
Analytics
● Implemented as Query Logics
○ Multi-step Query Execution: EdgeEventQuery
○ Iterative Query Execution: Graphs, etc.
■ Challenge: Implement ‘6-degrees of Kevin Bacon’ using Edges
● Implemented as MapReduce Jobs
○ See BulkResults / MapReduce API
○ Supports one-off jobs or oozie workflows
■ Challenge: Implement an Apache Spark runner.
Getting
Started
Datawave Quickstart
A quick path to getting a working DataWave
● Single node / Virtualbox VM
● Downloads, installs, configures almost everything
○ Java, Maven, Hadoop, Zookeeper, Accumulo, Wildfly
○ Datawave Ingest / Datawave Web
● Test Framework, Sample Queries
● Test SSL Certs
○ Install client certs - See quickstart reference page
● Prepares you for the guided tour of ingest and query.
● The troubleshooting guide is your friend.
Quickstart Anatomy
A good way to understand how things fit together
● A general purpose framework for managing ‘services’
○ Java, Maven, Hadoop, Accumulo, Datawave
○ Defines the required materials for managing each service.
● .../contrib/datawave-quickstart/bin/env.sh
○ The entrypoint to all this
○ Registers shell functions to manage services
○ Getting around:
■ Tab-complete is your friend.
■ git grep is your friend
1. Create a config.xml for your datatype (myNewType-config.xml)
Define how a given datatype should be processed
Configure:
a. InputFormat (examples: JsonInputFormat, CSVInputFormat, custom)
b. IngestHelper
c. DataTypeHandlers
d. Fields to index
e. Fields to reverse index
f. Different field processors, special handling of certain fields
g. Virtual Fields
...
Ingesting Your Data into Datawave
2. Add the new datatype values to properties and specify which
ingest flow
BULK_INGEST_DATA_TYPES=type2,type3
LIVE_INGEST_DATA_TYPES=type1,type4,myNewType
3. Add to the corresponding FlagMaker.xml
Ingesting Your Data into Datawave
<flagCfg>
<dataName>myNewType</dataName>
<distributionArgs>none</distributionArgs>
<extraIngestArgs>extra ingest args</extraIngestArgs>
<folder>myNewFolderName</folder>
<inputFormat>mySpecialInputFormatForThisNewType</inputFormat>
<ingestPool>myNewIngestPool</ingestPool>
<maxFlags>maximum number of flags for my new type</maxFlags>
</flagCfg>
If existing classes don’t meet your needs:
4. Create an InputFormat+RecordReader (impl EventRecordReader)
● Parse raw records from blocks to be passed as k,v pairs to the mapper
5. Create an IngestHelper (ext BaseIngestHelper)
● Parses field names and field values from a single raw record into a Multimap
of fields and values
6. Implement a DataTypeHandler
● Creates Accumulo entries from multimap above
● ShardedDataTypeHandler performs indexing on the RawRecordContainer fields,
creates entries for shard and global index tables
● ContentIndexingColumnBasedHandler for tokenization
Ingesting Your Data into Datawave
Next Steps
Other Things to Explore
This only really scratches the surface
● Cached Results
● Authorizations
● Administration
● Metrics
● Age-Off
Coming Soon
● Muchos: https://github.com/apache/fluo-muchos
Where to find it
Datawave:
http://code.nsa.gov/datawave
GitHub Project:
https://github.com/NationalSecurityAgency/datawave/
Questions and Pull Requests Welcome
Back Up Slides
Thanks!
Back Up Slides
Backups
EventMapper Anatomy: Raw Records to K,V Pairs
Datawave?
● Storage & Retrieval engine built on Apache Accumulo
○ Table Layout & Key Structure for partitioned indexes, data properties
and storage.
● Performs Query Parsing, Planning and Execution
○ Extended Boolean query syntax with additional operators
○ JEXL query syntax with functions
○ Query execution shaped by on data characteristics
○ Iterators to perform low level functions (e.g: Intersection)
● Provides an Ingest Workflow & MapReduce Framework
○ Job shaping using batch sizing for trading latency with throughput
● Exposes an API framework for Query, Analytics
○ REST API
● Supports a variety of data types and formats
○ Strings, Text, Numbers, IP Addresses, Geospatial Coordinates
This gets deleted on
we’ve captured all o
content
Datawave Architecture
warehouseingest
YARN
Wildfly
Datawave Web
HDFS
Accumulo
Datawave Ingest
Datawave REST API
Datawave Iterators
Datawave Query
MapReduce
Datawave Ingest Job
Zookeeper
Datawave Flag Maker
Many web front-ends
Many data sources
Potentially Independent
Ingest and Warehouse Clusters
Potentially Multiple HDFS
instances on the Warehouse
Side
Ingress
(e.g: HAProxy)
Dataflow
(e.g: NiFi)
Datawave Ingest
● Provides Ingest workflow with MapReduce framework
○ Job shaping using batch sizing for trading latency with throughput
text
image
html
...
csv
HDFS
Datawave
Flag
Maker
M/R Job
M/R Job
M/R Job
M/R Job
...
Datawave
Bulk
Loader
Low Latency
High
Throughput
Capabilities of Note
Some unique properties of Datawave
- Accumulo gives us...
- Transparent migration of data across nodes
- Field-level markings for access and data management
- Datawave adds...
- Date-oriented sharding, supporting data age-off
- Multiple Modes of Ingest: Live, Bulk, Mixed.

Weitere ähnliche Inhalte

Was ist angesagt?

Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Christophe Debruyne
 

Was ist angesagt? (20)

Running complex data queries in a distributed system
Running complex data queries in a distributed systemRunning complex data queries in a distributed system
Running complex data queries in a distributed system
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
The Internet in Database: A Cassandra Use Case
The Internet in Database: A Cassandra Use CaseThe Internet in Database: A Cassandra Use Case
The Internet in Database: A Cassandra Use Case
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Are you a Tortoise or a Hare?
Are you a Tortoise or a Hare?Are you a Tortoise or a Hare?
Are you a Tortoise or a Hare?
 
Quick overview on mongo db
Quick overview on mongo dbQuick overview on mongo db
Quick overview on mongo db
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Hands-On Apache Spark
Hands-On Apache SparkHands-On Apache Spark
Hands-On Apache Spark
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 

Ähnlich wie Introducing Datawave

Ähnlich wie Introducing Datawave (20)

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on Swift
 
[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift
[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift
[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
An Introduction to Postgresql
An Introduction to PostgresqlAn Introduction to Postgresql
An Introduction to Postgresql
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Introducing Datawave

  • 2. Agenda Getting started with Datawave ● Essential Knowledge ● Expectations: ○ What it is / What it isn’t ○ Use Cases / Not-so-great use cases ● What is this stuff? (High Level Architecture) ● Key Concepts ● Getting Started with DataWave Quick Start ● The good stuff (Ingest / Query / Table Structures) ● What to try next
  • 3. Essential Knowledge ● Hadoop (HDFS, YARN, MapReduce) ● Accumulo (General Architecture, Iterators, Authorizations, Shell) ● Zookeeper ● Wildfly
  • 5. What is Datawave? Storage & Retrieval engine built on Apache Accumulo ● Provides an Ingest Workflow & MapReduce Framework ● Exposes an API framework for Query, Analytics ● Performs Query Parsing, Planning and Execution ● Supports a variety of data types and formats
  • 6. Setting Expectations Datawave is not any of these things ● A SQL Database ○ No SQL language support, no ‘tables’ per se. ● A NoSQL Database ○ No generic put / get API operations a la S3 or MongoDB ● A Search Engine ○ Supports full-text retrieval but implements no relevance model Note: Any of these could be implemented using Datawave, but none were required for the use cases it currently supports
  • 7. Datawave Overview ● Storage & Retrieval engine built on Apache Accumulo Ingest Query text image html ... Various structured and unstructured data formats csv Analytics Graph Search Datawave Datawave
  • 8. warehouseingest Datawave Ingest Datawave Architecture YARN Wildfly Datawave Web Accumulo Datawave Iterators MapReduce Zookeeper Datawave Flag Maker Ingress (HAProxy ) Dataflow (NiFi) Many Data Sources / Multiple Hadoop Clusters / Multiple HDFS Instances / Many Web Front-ends Datawave Bulk Loader HDFS Datawave Query Datawave Ingest Job Datawave Tables text html image ... csv Datawave REST API
  • 9. ● Query Parsing, Planning and Execution ○ Extended Boolean query syntax with additional operators ○ JEXL or Lucene query syntax with functions ○ Query execution shaped by data characteristics ○ Iterators to perform low level functions (e.g: Intersection) ○ REST API Datawave Query FIELD == ‘VALUE’ AND FOO == ‘BAR’ field:value AND foo:bar Query/create Query/id/next Query/id/close POST Parses, validates, optimizes query Considers field cardinality, index, ... GET Returns pages of results REST endpoints POST Frees up Resources
  • 11. Data Model ● Records ○ May be referred to as a RawRecordContainer, Event or Document. ○ A collection of fields and content data ● Fields ○ One Record may have multiple fields, fields can be multi valued, each value the same type ○ Fields can be indexed for query or simply stored. ○ Special fields such as index-only, tokenized, and virtual fields ● Field Types & Normalizers ○ Raw field data into index entries using a Normalizer. ○ Normalizers exist for numbers, ip addresses, geospatial coordinates and produce lexically sortable values. ○ Fields are stored in both normalized and non-normalized form in Accumulo. Record Field Field Type Normalizer
  • 12. ● Overall Structure ○ Data is partitioned into shards ○ Each Shard is contained in a single tablet ○ Each Shard has its own Field & Term Index, Data & Record Storage ○ Shards are inherently date-based ○ Certain Field values tracked in Edges Shard Table Tables, Indexes & Shards Global Index Table Shard Field Index Record Storage Data Storage Term Index Edge Table Shard ShardShard Meta Data Table
  • 13. ● Datawave Metadata ○ Tracks fields, data types, normalizers. ○ (distinct from accumulo.metadata) ● Global Index ○ Index of data per shard, e.g: which shards contain the word ‘nutella’ ● Field Index ○ Index of terms within a shard, e.g: which records in shard contain ‘nutella’ ● Term Index ○ Index of terms within a record, e.g: which fields contain ‘nutella’, and its position Metadata & Indexes Shard Table Field Index Record Storage Data Storage Term Index Meta Data Table Global Index Table
  • 14. Records & Edges ● Shard Record / Event Storage ○ All fields for a single object ○ Raw form across multiple keys. ○ Used for object metadata display. ● Shard Data Storage ○ All content for a single object in a single K/V pair ○ Lookup content by id. ● Edges ○ Bipartite Relationship between field values in a record. ○ Allows all records with the same field values to be grouped together. ○ Different density characteristics from Record storage. ○ Supports Iterative graph building Shard Table Field Index Record Storage Data Storage Term Index Edge Table Global Index Table
  • 15. Query Abstractions Query Syntax vs. Query Logic ● Syntax: How a query is ‘described’ to Datawave ○ E.g.: Lucene vs. JEXL ● Logic: How a query is ‘executed’ in Datawave ○ E.g.: Which tables, what the plan is, what results? ■ EventQuery - record oriented retrieval ■ LookupUUID - find records given an ID ■ EdgeQuery - find records given edge members and attributes ■ DiscoveryQuery - record counts by attribute ■ MetricsQuery - find query metrics Query/create Query/next Query/close
  • 16. Analytics ● Implemented as Query Logics ○ Multi-step Query Execution: EdgeEventQuery ○ Iterative Query Execution: Graphs, etc. ■ Challenge: Implement ‘6-degrees of Kevin Bacon’ using Edges ● Implemented as MapReduce Jobs ○ See BulkResults / MapReduce API ○ Supports one-off jobs or oozie workflows ■ Challenge: Implement an Apache Spark runner.
  • 18. Datawave Quickstart A quick path to getting a working DataWave ● Single node / Virtualbox VM ● Downloads, installs, configures almost everything ○ Java, Maven, Hadoop, Zookeeper, Accumulo, Wildfly ○ Datawave Ingest / Datawave Web ● Test Framework, Sample Queries ● Test SSL Certs ○ Install client certs - See quickstart reference page ● Prepares you for the guided tour of ingest and query. ● The troubleshooting guide is your friend.
  • 19. Quickstart Anatomy A good way to understand how things fit together ● A general purpose framework for managing ‘services’ ○ Java, Maven, Hadoop, Accumulo, Datawave ○ Defines the required materials for managing each service. ● .../contrib/datawave-quickstart/bin/env.sh ○ The entrypoint to all this ○ Registers shell functions to manage services ○ Getting around: ■ Tab-complete is your friend. ■ git grep is your friend
  • 20. 1. Create a config.xml for your datatype (myNewType-config.xml) Define how a given datatype should be processed Configure: a. InputFormat (examples: JsonInputFormat, CSVInputFormat, custom) b. IngestHelper c. DataTypeHandlers d. Fields to index e. Fields to reverse index f. Different field processors, special handling of certain fields g. Virtual Fields ... Ingesting Your Data into Datawave
  • 21. 2. Add the new datatype values to properties and specify which ingest flow BULK_INGEST_DATA_TYPES=type2,type3 LIVE_INGEST_DATA_TYPES=type1,type4,myNewType 3. Add to the corresponding FlagMaker.xml Ingesting Your Data into Datawave <flagCfg> <dataName>myNewType</dataName> <distributionArgs>none</distributionArgs> <extraIngestArgs>extra ingest args</extraIngestArgs> <folder>myNewFolderName</folder> <inputFormat>mySpecialInputFormatForThisNewType</inputFormat> <ingestPool>myNewIngestPool</ingestPool> <maxFlags>maximum number of flags for my new type</maxFlags> </flagCfg>
  • 22. If existing classes don’t meet your needs: 4. Create an InputFormat+RecordReader (impl EventRecordReader) ● Parse raw records from blocks to be passed as k,v pairs to the mapper 5. Create an IngestHelper (ext BaseIngestHelper) ● Parses field names and field values from a single raw record into a Multimap of fields and values 6. Implement a DataTypeHandler ● Creates Accumulo entries from multimap above ● ShardedDataTypeHandler performs indexing on the RawRecordContainer fields, creates entries for shard and global index tables ● ContentIndexingColumnBasedHandler for tokenization Ingesting Your Data into Datawave
  • 24. Other Things to Explore This only really scratches the surface ● Cached Results ● Authorizations ● Administration ● Metrics ● Age-Off Coming Soon ● Muchos: https://github.com/apache/fluo-muchos
  • 25. Where to find it Datawave: http://code.nsa.gov/datawave GitHub Project: https://github.com/NationalSecurityAgency/datawave/ Questions and Pull Requests Welcome
  • 28. EventMapper Anatomy: Raw Records to K,V Pairs
  • 29. Datawave? ● Storage & Retrieval engine built on Apache Accumulo ○ Table Layout & Key Structure for partitioned indexes, data properties and storage. ● Performs Query Parsing, Planning and Execution ○ Extended Boolean query syntax with additional operators ○ JEXL query syntax with functions ○ Query execution shaped by on data characteristics ○ Iterators to perform low level functions (e.g: Intersection) ● Provides an Ingest Workflow & MapReduce Framework ○ Job shaping using batch sizing for trading latency with throughput ● Exposes an API framework for Query, Analytics ○ REST API ● Supports a variety of data types and formats ○ Strings, Text, Numbers, IP Addresses, Geospatial Coordinates This gets deleted on we’ve captured all o content
  • 30. Datawave Architecture warehouseingest YARN Wildfly Datawave Web HDFS Accumulo Datawave Ingest Datawave REST API Datawave Iterators Datawave Query MapReduce Datawave Ingest Job Zookeeper Datawave Flag Maker Many web front-ends Many data sources Potentially Independent Ingest and Warehouse Clusters Potentially Multiple HDFS instances on the Warehouse Side Ingress (e.g: HAProxy) Dataflow (e.g: NiFi)
  • 31. Datawave Ingest ● Provides Ingest workflow with MapReduce framework ○ Job shaping using batch sizing for trading latency with throughput text image html ... csv HDFS Datawave Flag Maker M/R Job M/R Job M/R Job M/R Job ... Datawave Bulk Loader Low Latency High Throughput
  • 32. Capabilities of Note Some unique properties of Datawave - Accumulo gives us... - Transparent migration of data across nodes - Field-level markings for access and data management - Datawave adds... - Date-oriented sharding, supporting data age-off - Multiple Modes of Ingest: Live, Bulk, Mixed.

Hinweis der Redaktion

  1. Drew
  2. Drew
  3. Hannah / Drew?
  4. Drew
  5. Hannah
  6. When you put it all together, it looks something like this. Typically Datawave is fed using Apache Nifi to stage data on HDFS.
  7. Hannah
  8. Drew
  9. Drew / Hannah?
  10. Drew / Hannah?
  11. Drew / Hannah?
  12. Drew
  13. Drew
  14. Drew / Hannah?
  15. Drew Hannah
  16. Hannah?
  17. Hannah?
  18. Hannah?
  19. Drew
  20. Drew / Hannah?
  21. Hannah
  22. Hannah / Drew?