SlideShare a Scribd company logo
1 of 50
Download to read offline
1
Eduardo Alonso
eduardoalonso@stratio.com
Andrés de la Peña
andres@stratio.com
GEOSPATIAL
AND BITEMPORAL
SEARCH IN C* WITH PLUGGABLE LUCENE INDEX
@a_de_la_pena @eAlonsoDB
•  Stratio is a Big Data Company
•  Certified Spark distribution
•  Founded in 2013
•  120+ employees in Madrid
•  Offices in Madrid and San Francisco
#CassandraSummit 2015
WHO WE ARE
Pluggable Lucene based 2i
Geospatial Search
Bitemporal Indexes
1
2
3
CONTENTS
PLUGGABLE
LUCENE 2i
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
#CassandraSummit 2015 5
primary key
secondary indexes
token ranges
Cassandra query methods by use case
#CassandraSummit 2015 6
primary key
secondary indexes
token ranges
Real time Analytics
Cassandra query methods trade offs
#CassandraSummit 2015 7
•  Pure-range queries limited to partition
•  No Boolean logic
•  No Full text search
•  Sorting limited to partition
•  Full-table scan
•  High load
•  High latency
•  Low concurrency
primary key
secondary indexes
token ranges
primary key
secondary indexes
token ranges
Real time Analytics
A third use case
#CassandraSummit 2015 8
AnalyticsReal-time Search
•  Not as fast as primary key queries
•  Not as expressive as map reduce
•  Search can be used for both cases
#CassandraSummit 2015 9
CQL + Lucene
A Lucene based secondary index implementation
A Lucene based secondary index implementation
•  Proven stable and fast indexing solution
•  Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
•  Mature distributed search solutions built on top of it
- Solr, ElasticSearch
•  Just a small embeddable library
•  Easily extensible
•  Published under the Apache License
#CassandraSummit 2015 10
Cassandra query methods
#CassandraSummit 2015 11
primary key
token ranges
primary key
secondary indexes
token ranges
primary key
secondary indexes
token ranges
•  Mid expressiveness
•  Mid latency
•  Mid load
•  Low expressiveness
•  Low latency
•  Low load
•  High expressiveness
•  High latency
•  High load
Real time AnalyticsSearch
A Lucene based secondary index implementation
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
#CassandraSummit 2015 12
•  Each node indexes its own data
•  Keep P2P architecture
•  Distribution and replication managed by C*
•  Just a single pluggable JAR file
- CASSANDRA-8717
JVM
JVM
JVM
CREATE TABLE tweets (
id bigint,
created timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, created,
id) );
Create index
•  Built in the background in any moment
•  Real time updates
•  Mapping eases ETL
•  Language aware
#CassandraSummit 2015 13
ALTER TABLE tweets ADD lucene TEXT;
CREATE CUSTOM INDEX tweets_idx ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '10',
'schema' : ' fields : {
created : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : "english"},
userid : {type : "string"},
username : {type : "string"} } '};
SELECT * FROM tweets WHERE lucene = '{
filter : {
type : "boolean",
must : [
{type : "range", field : "created_at", lower : "2015/01/01"},
{type : "wildcard", field : "user", value : "a*"}
],
not : [
{type : "match", field : "user", value : "andres"}
]
},
sort : {
fields: [
{field : "time", reverse : true},
{field : "user", reverse : false}
]
}
}' LIMIT 10000;
Searching for rows
#CassandraSummit 2015 14
Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
•  Compute large amounts of data
•  Filtering push-down
•  Avoid systematic full scan
•  Reduces the amount of data to be processed
#CassandraSummit 2015 15
Index performance in Spark
#CassandraSummit 2015 16
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60 70 80 90 100
seconds
millions of collected rows
index
full scan
SPATIAL SEARCH
Lucene spatial module
•  Spatial4J shapes
-  Points, rectangles, circles, etc.
•  Spatial search strategies
-  BBox, RecursivePrefixTree, PointVector, etc.
•  Not only geographical data
-  Numbers, dates
•  It can be combined with other searches
#CassandraSummit 2015 18
Indexing geographical locations
#CassandraSummit 2015 19
CREATE CUSTOM INDEX restaurants_idx
ON restaurants (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
location : {
type : "geo_point",
latitude : "lat",
longitude : "lon"
},
stars: {type : "integer" }
}
}
'};
CREATE TABLE restaurants(
name text PRIMARY KEY,
stars bigint,
lat double,
lon double);
•  No native shape data types in CQL
•  Many-to-one column mapping
•  Just points. For now.
Bounding box search
#CassandraSummit 2015 20
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_bbox",
field : "location",
min_latitude : 40.425978,
max_latitude : 40.445886,
min_longitude : -3.808252,
max_longitude : -3.770999
}
}';
Distance search
#CassandraSummit 2015 21
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
min_distance : "100m",
max_distance : "2km"
}
}';
Combining geospatial searches
#CassandraSummit 2015 22
SELECT * FROM restaurants WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
max_distance : "10km"
},
{
type : "range",
field : "stars",
lower : 2,
upper : 4
}
] } }';
Lucene spatial is not only geospatial…
#CassandraSummit 2015 23
•  General geometry
•  Numeric ranges
-  NumberRangePrefixTree
•  Date ranges/durations
-  DateRangePrefixTree
Temporal/Date durations
#CassandraSummit 2015 24
•  A pair composed by a start-date and a stop-date
-  Can be indexed as points in a 2D space
•  David Smiley's DateRangePrefixTree
-  Levels for common date-ranges: years, months, days…
-  Spatial operations: intersects, is_within, contains
27 Nov 2015 29 Dec 2015
intersects
is - within
contains
Indexing date ranges
#CassandraSummit 2015 25
CREATE CUSTOM INDEX breakdowns_idx
ON breakdowns (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
duration: {
type : "date_range",
from : "start_date",
to : "stop_date",
pattern : "yyyy-MM-dd"
},
cause: {type : "string" }
}
}
'};
CREATE TABLE breakdowns (
system text PRIMARY KEY,
cause text,
start_date timestamp,
stop_date timestamp);
•  No native date range type in CQL
•  Many-to-one column mapping
•  Spatial operations
Searching for date ranges
#CassandraSummit 2015 26
SELECT * FROM breakdowns
WHERE lucene =
'{
filter :
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "intersects"
}
}';
SELECT * FROM users
WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "is_within"
},
{
type : "match",
field : "cause",
value : "human error"
}
] } }';
INDEXING
BITEMPORAL
DATA
The bitemporal data model
#CassandraSummit 2015 28
•  Stores WHAT and WHEN
•  Support for corrections.
•  Reproducible business perspective history at a point of time.
•  Trace why a decision was made.
The bitemporal data model
#CassandraSummit 2015 29
•  Valid Time
- The application period
- WHAT happened, the real time fact period
•  Transaction Time
- The system period
- WHEN the system consider it true
The bitemporal data model: example
#CassandraSummit 2015 30
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
A naïve approach
#CassandraSummit 2015 31
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
vt_from : { type : "date", pattern : "yyyyMMdd" },
vt_to : { type : "date", pattern : "yyyyMMdd" },
tt_from : { type : "date", pattern : "yyyyMMdd" },
tt_to : { type : "date", pattern : "yyyyMMdd" }
}} '};
Using 4 dates
A naive approach
#CassandraSummit 2015 32
SELECT * FROM census WHERE lucene =
'{ filter : { type : "boolean",
must : [
should : [
{ type : "range", field : "vt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : "vt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : "vt_from", upper : "", include_upper=true},
{ type : "range", field : "vt_to", lower : "", include_lower=true}]
],
should : [
{ type : "range", field : “tt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : “tt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : “tt_from", upper : "", include_upper=true},
{ type : "range", field : “tt_to", lower : "", include_lower=true}
]
]
] } }' AND person = 'John Doe';
A naive approach: Issues
#CassandraSummit 2015 33
•  Very difficult to understand/build the query.
•  Now value (∞) using Long.MAX_VALUE is costly.
A spatial approach
#CassandraSummit 2015 34
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{ fields : {
vt: {
type : "date_range", pattern : "yyyyMMdd",
from : "vt_from", to : "vt_to"
},
tt: {
type : "date_range", pattern : "yyyyMMdd",
from : "tt_from", to : "tt_to"
},
} } '};
Using 2 date ranges
A spatial approach
#CassandraSummit 2015 35
SELECT * FROM users WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range", field : "vt",
from : "20150501", to : "99999999",
operation : "intersects"
},
{
type : "date_range", field : "tt",
from : "20150501", to : "9999999999",
operation : "intersects"
}
] } }';
A spatial approach: performance issues
#CassandraSummit 2015 36
•  Very difficult to understand/build the query.
•  Now value (∞) using Long.MAX_VALUE is costly.
4R-Tree to the rescue
#CassandraSummit 2015 37
•  Based on
Bliujute, R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of
general bitemporal data
•  The Now Value is never stored.
•  The data is stored in 4 R-Trees.
•  Queries are transformed and distributed among the trees.
Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to)
Rectangle(vt_from,vt_to,
tt_from,tt_to)Line(vt_from,vt_to,tt_to)
4R-Tree to the rescue: storing data
#CassandraSummit 2015 38
TT_TO==NOW && VT_TO==NOW
TT_TO==NOW && VT_TO!=NOW
TT_TO!=NOW && VT_TO==NOW
TT_TO!=NOW && VT_TO!=NOW
•
R1 R2 R3 R4
4R-Tree to the rescue: searching data
#CassandraSummit 2015 39
IF (TT_FROM!=NOW) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM!=NOW) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]):
R1 U R2
4R-Tree to the rescue:
#CassandraSummit 2015 40
•  Problem!!! Lucene does not have support for R-Tree
•  Our Solution:
- Use 2 DateRangePrefixTrees for each R-Tree
•  Future Work: Experiment with other Lucene spatial trees and strategies.
The bitemporal data model: example
#CassandraSummit 2015 41
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Indexing bitemporal data
#CassandraSummit 2015 42
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING
'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{
fields : {
bitemporal : {
type : "bitemporal",
vt_from : "vt_from",
vt_to : "vt_to",
tt_from : "vt_from",
tt_to : "tt_to",
pattern : "yyyyMMdd"
now_value : "99999999"
},
city : { type : "string" }
}
} '};
CREATE TABLE census (
person text,
city text,
vt_from text,
vt_to text,
tt_from text,
tt_to text,
lucene text,
PRIMARY KEY((person),vt_from,tt_from)
);
Searching for bitemporal data, several queries
#CassandraSummit 2015 43
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "99999999",
vt_to : "99999999",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
Where does the system currently
think that John lives right now?
person city vt_from vt_to tt_from tt_to
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Searching for bitemporal data
#CassandraSummit 2015 44
person city vt_from vt_to tt_from tt_to
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
Where does the system currently
think that John lived in 1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
#CassandraSummit 2015 45
On 01-Jan-2000, where did the
system think John was living back in
1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : “20000101",
tt_to : “20000101"
}
}' AND person = 'John Doe';
person city vt_from vt_to tt_from tt_to
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
Searching for bitemporal data
#CassandraSummit 2015 46
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "boolean",
must : [
{ type : "bitemporal", field : "bitemporal",
vt_from : "99999999", vt_to : "99999999",
tt_from : "99999999", tt_to : "99999999"
},
{ type : "match",
field : "city",
value : "smallville"}
]}
}}';
Who currently lives at Smallville?
Searching for bitemporal data
CONCLUSIONS
Conclusions
•  Pluggable Lucene features in Cassandra
•  Basic geospatial features
•  Date/Time durations
•  Bitemporal data model indexing
•  Compatible with MapReduce frameworks
•  Preserves Cassandra's functionality
#CassandraSummit 2015 48
github.com/stratio/cassandra-lucene-index
•  Published as plugin for Apache Cassandra
•  Apache License Version 2.0
Its open source
#CassandraSummit 2015 49
BIG DATA
CHILD`S PLAY
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Eduardo Alonso
eduardoalonso@stratio.com
@eAlonsoDB

More Related Content

Similar to Geospatial and bitemporal search in cassandra with pluggable lucene index

Similar to Geospatial and bitemporal search in cassandra with pluggable lucene index (20)

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
Stratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use casesStratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use cases
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in Cassandra
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
 
Presentation
PresentationPresentation
Presentation
 
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-Data
 
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
 
MongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB Stitch Introduction
MongoDB Stitch Introduction
 
NoSQL Data Modeling using Couchbase
NoSQL Data Modeling using CouchbaseNoSQL Data Modeling using Couchbase
NoSQL Data Modeling using Couchbase
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Geospatial and bitemporal search in cassandra with pluggable lucene index

  • 1. 1 Eduardo Alonso eduardoalonso@stratio.com Andrés de la Peña andres@stratio.com GEOSPATIAL AND BITEMPORAL SEARCH IN C* WITH PLUGGABLE LUCENE INDEX @a_de_la_pena @eAlonsoDB
  • 2. •  Stratio is a Big Data Company •  Certified Spark distribution •  Founded in 2013 •  120+ employees in Madrid •  Offices in Madrid and San Francisco #CassandraSummit 2015 WHO WE ARE
  • 3. Pluggable Lucene based 2i Geospatial Search Bitemporal Indexes 1 2 3 CONTENTS
  • 5. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2015 5
  • 6. primary key secondary indexes token ranges Cassandra query methods by use case #CassandraSummit 2015 6 primary key secondary indexes token ranges Real time Analytics
  • 7. Cassandra query methods trade offs #CassandraSummit 2015 7 •  Pure-range queries limited to partition •  No Boolean logic •  No Full text search •  Sorting limited to partition •  Full-table scan •  High load •  High latency •  Low concurrency primary key secondary indexes token ranges primary key secondary indexes token ranges Real time Analytics
  • 8. A third use case #CassandraSummit 2015 8 AnalyticsReal-time Search •  Not as fast as primary key queries •  Not as expressive as map reduce •  Search can be used for both cases
  • 9. #CassandraSummit 2015 9 CQL + Lucene A Lucene based secondary index implementation
  • 10. A Lucene based secondary index implementation •  Proven stable and fast indexing solution •  Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. •  Mature distributed search solutions built on top of it - Solr, ElasticSearch •  Just a small embeddable library •  Easily extensible •  Published under the Apache License #CassandraSummit 2015 10
  • 11. Cassandra query methods #CassandraSummit 2015 11 primary key token ranges primary key secondary indexes token ranges primary key secondary indexes token ranges •  Mid expressiveness •  Mid latency •  Mid load •  Low expressiveness •  Low latency •  Low load •  High expressiveness •  High latency •  High load Real time AnalyticsSearch
  • 12. A Lucene based secondary index implementation CLIENT C* node C* node C* node Lucene index Lucene index Lucene index #CassandraSummit 2015 12 •  Each node indexes its own data •  Keep P2P architecture •  Distribution and replication managed by C* •  Just a single pluggable JAR file - CASSANDRA-8717 JVM JVM JVM
  • 13. CREATE TABLE tweets ( id bigint, created timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, created, id) ); Create index •  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware #CassandraSummit 2015 13 ALTER TABLE tweets ADD lucene TEXT; CREATE CUSTOM INDEX tweets_idx ON tweets (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '10', 'schema' : ' fields : { created : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : "english"}, userid : {type : "string"}, username : {type : "string"} } '};
  • 14. SELECT * FROM tweets WHERE lucene = '{ filter : { type : "boolean", must : [ {type : "range", field : "created_at", lower : "2015/01/01"}, {type : "wildcard", field : "user", value : "a*"} ], not : [ {type : "match", field : "user", value : "andres"} ] }, sort : { fields: [ {field : "time", reverse : true}, {field : "user", reverse : false} ] } }' LIMIT 10000; Searching for rows #CassandraSummit 2015 14
  • 15. Integrating Lucene & Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene •  Compute large amounts of data •  Filtering push-down •  Avoid systematic full scan •  Reduces the amount of data to be processed #CassandraSummit 2015 15
  • 16. Index performance in Spark #CassandraSummit 2015 16 0 500 1000 1500 2000 2500 0 10 20 30 40 50 60 70 80 90 100 seconds millions of collected rows index full scan
  • 18. Lucene spatial module •  Spatial4J shapes -  Points, rectangles, circles, etc. •  Spatial search strategies -  BBox, RecursivePrefixTree, PointVector, etc. •  Not only geographical data -  Numbers, dates •  It can be combined with other searches #CassandraSummit 2015 18
  • 19. Indexing geographical locations #CassandraSummit 2015 19 CREATE CUSTOM INDEX restaurants_idx ON restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); •  No native shape data types in CQL •  Many-to-one column mapping •  Just points. For now.
  • 20. Bounding box search #CassandraSummit 2015 20 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }';
  • 21. Distance search #CassandraSummit 2015 21 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }';
  • 22. Combining geospatial searches #CassandraSummit 2015 22 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, max_distance : "10km" }, { type : "range", field : "stars", lower : 2, upper : 4 } ] } }';
  • 23. Lucene spatial is not only geospatial… #CassandraSummit 2015 23 •  General geometry •  Numeric ranges -  NumberRangePrefixTree •  Date ranges/durations -  DateRangePrefixTree
  • 24. Temporal/Date durations #CassandraSummit 2015 24 •  A pair composed by a start-date and a stop-date -  Can be indexed as points in a 2D space •  David Smiley's DateRangePrefixTree -  Levels for common date-ranges: years, months, days… -  Spatial operations: intersects, is_within, contains 27 Nov 2015 29 Dec 2015 intersects is - within contains
  • 25. Indexing date ranges #CassandraSummit 2015 25 CREATE CUSTOM INDEX breakdowns_idx ON breakdowns (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { duration: { type : "date_range", from : "start_date", to : "stop_date", pattern : "yyyy-MM-dd" }, cause: {type : "string" } } } '}; CREATE TABLE breakdowns ( system text PRIMARY KEY, cause text, start_date timestamp, stop_date timestamp); •  No native date range type in CQL •  Many-to-one column mapping •  Spatial operations
  • 26. Searching for date ranges #CassandraSummit 2015 26 SELECT * FROM breakdowns WHERE lucene = '{ filter : { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "intersects" } }'; SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "is_within" }, { type : "match", field : "cause", value : "human error" } ] } }';
  • 28. The bitemporal data model #CassandraSummit 2015 28 •  Stores WHAT and WHEN •  Support for corrections. •  Reproducible business perspective history at a point of time. •  Trace why a decision was made.
  • 29. The bitemporal data model #CassandraSummit 2015 29 •  Valid Time - The application period - WHAT happened, the real time fact period •  Transaction Time - The system period - WHEN the system consider it true
  • 30. The bitemporal data model: example #CassandraSummit 2015 30 person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞ Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database
  • 31. A naïve approach #CassandraSummit 2015 31 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { vt_from : { type : "date", pattern : "yyyyMMdd" }, vt_to : { type : "date", pattern : "yyyyMMdd" }, tt_from : { type : "date", pattern : "yyyyMMdd" }, tt_to : { type : "date", pattern : "yyyyMMdd" } }} '}; Using 4 dates
  • 32. A naive approach #CassandraSummit 2015 32 SELECT * FROM census WHERE lucene = '{ filter : { type : "boolean", must : [ should : [ { type : "range", field : "vt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : "vt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : "vt_from", upper : "", include_upper=true}, { type : "range", field : "vt_to", lower : "", include_lower=true}] ], should : [ { type : "range", field : “tt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : “tt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : “tt_from", upper : "", include_upper=true}, { type : "range", field : “tt_to", lower : "", include_lower=true} ] ] ] } }' AND person = 'John Doe';
  • 33. A naive approach: Issues #CassandraSummit 2015 33 •  Very difficult to understand/build the query. •  Now value (∞) using Long.MAX_VALUE is costly.
  • 34. A spatial approach #CassandraSummit 2015 34 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { vt: { type : "date_range", pattern : "yyyyMMdd", from : "vt_from", to : "vt_to" }, tt: { type : "date_range", pattern : "yyyyMMdd", from : "tt_from", to : "tt_to" }, } } '}; Using 2 date ranges
  • 35. A spatial approach #CassandraSummit 2015 35 SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "vt", from : "20150501", to : "99999999", operation : "intersects" }, { type : "date_range", field : "tt", from : "20150501", to : "9999999999", operation : "intersects" } ] } }';
  • 36. A spatial approach: performance issues #CassandraSummit 2015 36 •  Very difficult to understand/build the query. •  Now value (∞) using Long.MAX_VALUE is costly.
  • 37. 4R-Tree to the rescue #CassandraSummit 2015 37 •  Based on Bliujute, R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of general bitemporal data •  The Now Value is never stored. •  The data is stored in 4 R-Trees. •  Queries are transformed and distributed among the trees.
  • 38. Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to) Rectangle(vt_from,vt_to, tt_from,tt_to)Line(vt_from,vt_to,tt_to) 4R-Tree to the rescue: storing data #CassandraSummit 2015 38 TT_TO==NOW && VT_TO==NOW TT_TO==NOW && VT_TO!=NOW TT_TO!=NOW && VT_TO==NOW TT_TO!=NOW && VT_TO!=NOW • R1 R2 R3 R4
  • 39. 4R-Tree to the rescue: searching data #CassandraSummit 2015 39 IF (TT_FROM!=NOW) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM!=NOW) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]): R1 U R2
  • 40. 4R-Tree to the rescue: #CassandraSummit 2015 40 •  Problem!!! Lucene does not have support for R-Tree •  Our Solution: - Use 2 DateRangePrefixTrees for each R-Tree •  Future Work: Experiment with other Lucene spatial trees and strategies.
  • 41. The bitemporal data model: example #CassandraSummit 2015 41 Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  • 42. Indexing bitemporal data #CassandraSummit 2015 42 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { bitemporal : { type : "bitemporal", vt_from : "vt_from", vt_to : "vt_to", tt_from : "vt_from", tt_to : "tt_to", pattern : "yyyyMMdd" now_value : "99999999" }, city : { type : "string" } } } '}; CREATE TABLE census ( person text, city text, vt_from text, vt_to text, tt_from text, tt_to text, lucene text, PRIMARY KEY((person),vt_from,tt_from) );
  • 43. Searching for bitemporal data, several queries #CassandraSummit 2015 43 SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe'; Where does the system currently think that John lives right now? person city vt_from vt_to tt_from tt_to John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  • 44. Searching for bitemporal data #CassandraSummit 2015 44 person city vt_from vt_to tt_from tt_to John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ Where does the system currently think that John lived in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe';
  • 45. #CassandraSummit 2015 45 On 01-Jan-2000, where did the system think John was living back in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : “20000101", tt_to : “20000101" } }' AND person = 'John Doe'; person city vt_from vt_to tt_from tt_to John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 Searching for bitemporal data
  • 46. #CassandraSummit 2015 46 SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" }, { type : "match", field : "city", value : "smallville"} ]} }}'; Who currently lives at Smallville? Searching for bitemporal data
  • 48. Conclusions •  Pluggable Lucene features in Cassandra •  Basic geospatial features •  Date/Time durations •  Bitemporal data model indexing •  Compatible with MapReduce frameworks •  Preserves Cassandra's functionality #CassandraSummit 2015 48
  • 49. github.com/stratio/cassandra-lucene-index •  Published as plugin for Apache Cassandra •  Apache License Version 2.0 Its open source #CassandraSummit 2015 49
  • 50. BIG DATA CHILD`S PLAY Andrés de la Peña andres@stratio.com @a_de_la_pena Eduardo Alonso eduardoalonso@stratio.com @eAlonsoDB