SlideShare a Scribd company logo
1 of 65
Download to read offline
Massively Scalable Real-time
Geospatial Data Processing with
Apache Kafka and Cassandra
Paul Brebner
instaclustr.com Technology Evangelist
ApacheCon Geospatial Track 2020
© Instaclustr Pty Limited, 2020
Instaclustr Overview
©Instaclustr Pty Limited, 2020
Overview
• “Location” in the news in 2020
• Anomaly Detection: baseline throughput
• Spatial Anomaly Detection problem
• Solutions: location representation and querying/indexing
o Bounding boxes and secondary indexes
o Geohashes
o Lucene index
o Results
o 3D
© 2020 Instaclustr Pty Limited
In the
News
John Conway
Legendary Polymath
Passed away this
year
© 2020 Instaclustr Pty Limited
Game of
Life
Next state of each cell depends on state of
immediate neighbors
© 2020 Instaclustr Pty Limited
Simple rules but complex patterns
Game of
Life
© 2020 Instaclustr Pty Limited
Also In
the News
Social distancing and COVID-19 tracing
Uncle Ron’s social
distancing 3000
invention
Or COVIDSAFE App?
© 2020 Instaclustr Pty Limited
Uncle Ron’s social distancing
3000 invention
Also In
the News
And “planet-killer”
asteroid missed the
Earth in April
(16x moon orbit)
© 2020 Instaclustr Pty Limited
Previously…
KAFKA,
CASSANDRA AND
KUBERNETES AT
SCALE: REAL-
TIME ANOMALY
DETECTION ON 19
BILLION EVENTS
A DAY
Anomaly
Detection
Spot the difference
At speed (< 1s RT) and scale
(high throughput, lots of data)
© 2020 Instaclustr Pty Limited
How
Does It
Work?
• CUSUM (Cumulative Sum Control Chart)
• Statistical analysis of historical data
• Data for a single variable/key at a time
• Potentially billions of keys
© 2020 Instaclustr Pty Limited
Pipeline
Design
• Kafka, Cassandra and Kubernetes Clusters – Kafka handles
streaming, Cassandra for data storage, Kubernetes for
application scaling
• Efficient Cassandra Data writes and reads with key,
a unique “account ID” or similar
© 2020 Instaclustr Pty Limited
Kubernetes
Cassandra
Data
Model
• Events are timeseries
• Id is Partition Key - Time is clustering key (order)
• Read gets most recent 50 values for id, very fast
create table event_stream (
id bigint,
time timestamp,
value double,
primary key (id, time)
) with clustering order by (time desc);
select value from event_stream where id=314159265 limit 50;
© 2020 Instaclustr Pty Limited
Cluster
details –
CPU cores
Cassandra 384 +
Application pipeline
(workers) 118 +
Kafka 72 =
574 Cores Total
384
118
72
0
100
200
300
400
500
600
700
Cores
Cores Used
574 Total
Cassandra Workers Kafka
Baseline
Throughput
19 Billion Anomaly Checks/Day = 100%
0
20
40
60
80
100
120
Baseline (single transaction ID)
Normalised (%)
© 2020 Instaclustr Pty Limited
Harder
Problem:
Spot the
Differences
in Space
Space is big. Really big. You just won’t believe how vastly,
hugely, mind-bogglingly big it is. I mean, you may think it’s a
long way down the road to the chemist, but that’s just
peanuts to space.
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
© 2020 Instaclustr Pty Limited
Spatial
Anomalies
Many and varied
© 2020 Instaclustr Pty Limited
Real
Example:
John Snow
No, not this one!
© 2020 Instaclustr Pty Limited
John Snow’s
1854 Cholera Map
• Death’s per household + location
• Used to identify a polluted pump (X)
• Some outliers—brewers drank beer not water! X
© 2020 Instaclustr Pty Limited
But…
First you have to
know where you are:
Location
To usefully represent location need:
• Coordinate system
• Map
• Scale
© 2020 Instaclustr Pty Limited
Better
• <lat, long> coordinates
• Scale
• Interesting locations,
“bulk of treasure here”
© 2020 Instaclustr Pty Limited
Geospatial Anomaly Detection
South Atlantic Geomagnetic Anomaly
New problem…
• Rather than a single ID, events now have a location (and a value)
• The problem now is to
o find the nearest 50 events to each new event
o Quickly (< 1s RT)
• Can’t make any assumptions about geospatial
properties of events
o including location, density or distribution – i.e. where, or how many
o Need to search from smallest to increasingly larger areas
o E.g. South Atlantic Geomagnetic Anomaly is BIG
• Uber (ApacheCon 2019 talk) uses similar
• technologies to
o forecast demand
o Increase area until they have sufficient data for predictions
• Can we use <lat, long> as Cassandra partition key?
o Yes, compound partition keys are allowed.
o But can only select the exact locations. © 2020 Instaclustr Pty Limited
How to Compute Nearness
To compute distance between locations
you need coordinate system
e.g. Mercator map
Flat earth, distortion nearer poles
© 2020 Instaclustr Pty Limited
World is (approx.) Spherical
Calculation of distance between two
latitudinal/longitudinal points is non-trivial
© 2020 Instaclustr Pty Limited
Bounding Box
Approximation of distance using inequalities
© 2020 Instaclustr Pty Limited
Bounding Boxes and Cassandra?
• Use ”country” partition key, Lat/long/time clustering keys
• But can’t run the query with multiple inequalities
CREATE TABLE latlong (
country text,
lat double,
long double,
time timestamp,
PRIMARY KEY (country, lat, long, time)
) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC);
select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67
and long >= 175.18 and long <= 176.08 limit 50;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column
"long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)"
© 2020 Instaclustr Pty Limited
Secondary
Indexes to
the
Rescue?
Secondary Indexes
o create index i1 on latlong (lat);
o create index i2 on latlong (long);
• But same restrictions as clustering columns.
SASI - SSTable Attached Secondary Index
• Supports more complex queries more efficiently
o create custom index i1 on latlong (long) using
'org.apache.cassandra.index.sasi.SASIIndex';
o create custom index i2 on latlong (lat) using
'org.apache.cassandra.index.sasi.SASIIndex’;
• select * from latlong where country='nz' and lat>= -39.58 and lat
<= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow
filtering;
• “allow filtering” may be inefficient (if many rows have to be
retrieved prior to filtering) and isn’t suitable for production.
• But SASI docs say
o even though “allow filtering” must be used with 2 or more column
inequalities, there is actually no filtering taking place
© 2020 Instaclustr Pty Limited
Results
Very slow (< 1%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI
© 2020 Instaclustr Pty Limited
Geohashes
to the
Rescue?
• Divide maps into named and hierarchical areas
• We’ve been something similar already: “country” partition key
E.g. plate tectonics
© 2020 Instaclustr Pty Limited
Geohashes
Rectangular areas
Variable length base-32 string
Single char regions 5,000km x 5,000km
Each extra letter gives 32 sub-areas
8 chars is 40mx20m
En/de-code lat/long to/from geohash
But: Edges cases, non-linear near poles
© 2020 Instaclustr Pty Limited
Some
Geohashes
Are Words
“ketchup” is in Africa
© 2020 Instaclustr Pty Limited
Some
Geohashes
Are Words
153mx153m153m x 153m
© 2020 Instaclustr Pty Limited
“Trump”
Is in
Kazakhstan!
Not to scale
5kmx5km
© 2020 Instaclustr Pty Limited
Modifications for Geohashes
• Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
In theory Geohashes work well for
database indexes
CREATE TABLE geohash1to8 (
geohash1 text,
time timestamp,
geohash2 text,
geohash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (hash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE INDEX i8 ON geohash1to8 (geohash8);
CREATE INDEX i7 ON geohash1to8 (geohash7);
CREATE INDEX i6 ON geohash1to8 (geohash6);
CREATE INDEX i5 ON geohash1to8 (geohash5);
CREATE INDEX i4 ON geohash1to8 (geohash4);
CREATE INDEX i3 ON geohash1to8 (geohash3);
CREATE INDEX i2 ON geohash1to8 (geohash2);
Option 1:
Multiple Indexed Geohash Columns
© 2020 Instaclustr Pty Limited
• Query from smallest to largest areas
select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50;
select * from geohash1to8 where geohash1=’e’ limit 50;
Tradeoffs?
Multiple secondary columns/indexes, multiple queries, accuracy and
number of queries depends on spatial distribution and density
• Stop when 50 rows found
© 2020 Instaclustr Pty Limited
Results
Slightly better
Option 1 = 10%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
Denormalization is “Normal” in Cassandra
Create 8 tables, one for each geohash length
CREATE TABLE geohash1 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
) WITH CLUSTERING ORDER BY (time DESC);
…
CREATE TABLE geohash8 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
) WITH CLUSTERING ORDER BY (time DESC);
Option 2:
Denormalized Multiple Tables
© 2020 Instaclustr Pty Limited
• Select from smallest to largest areas using corresponding table
select * from geohash8 where geohash=’everywhe’ limit 50;
select * from geohash7 where geohash=’everywh’ limit 50;
select * from geohash6 where geohash=’everyw’ limit 50;
select * from geohash5 where geohash=’every’ limit 50;
select * from geohash4 where geohash=’ever’ limit 50;
select * from geohash3 where geohash=’eve’ limit 50;
select * from geohash2 where geohash=’ev’ limit 50;
select * from geohash1 where geohash=’e’ limit 50;
Tradeoffs?
Multiple tables and writes, multiple queries
© 2020 Instaclustr Pty Limited
Results
Getting better
Option 2 = 20%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI
Geohash Option 1 Geohash Option 2
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
Similar to Option 1 but using
clustering columns
CREATE TABLE geohash1to8_clustering (
geohash1 text,
time timestamp,
geohash2 text,
gephash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (geohash1, geohash2, geohash3, geohash4,
geohash5, geohash6, geohash7, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC,
geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC,
geohash8 DESC, time DESC);
Option 3:
Clustering Column(s)
© 2020 Instaclustr Pty Limited
How Do Clustering Columns Work?
• Clustering columns are good for modelling and efficient querying of
hierarchical/nested data
• Query must include higher level columns with equality operator, ranges are
only allowed on last column in query, lower level columns don’t have to be
included.
For example:
o select * from geohash1to8_clustering where geohash1=’e’ and
geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50;
• But why have multiple clustering columns when one is actually enough…
Good for Hierarchical Data
© 2020 Instaclustr Pty Limited
Better: Single Geohash
Clustering Column
Geohash8 and time are
clustering keys CREATE TABLE geohash_clustering (
geohash1 text,
time timestamp,
geohash8 text,
lat double,
long double,
PRIMARY KEY (geohash1, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash8 DESC,
time DESC);
© 2020 Instaclustr Pty Limited
Inequality Range Query
• With decreasing length
geohashes
• Stop when result has 50 rows
select * from geohash_clustering where geohash1=’e’ and
geohash8=’everywhe’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’every0’ and geohash8 <=’everyz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’ever0’ and geohash8 <=’everz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’eve0’ and geohash8 <=’evez’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’ev0’ and geohash8 <=’evz’ limit 50;
select * from geohash_clustering where geohash1=’e’ limit 50;
© 2020 Instaclustr Pty Limited
Geohash
Results
Option 3 is best = 34%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3
© 2020 Instaclustr Pty Limited
Issues?
■ Cardinality for partition key
● should be > 100,000
● >= 4 character geohash
■ Unbounded partitions are bad
● May need composite partition key in
production
● e.g. extra time bucket (hour, day, etc)
■ Space vs time
● could have different sized buckets for
different sized spaces
● E.g. bigger areas with more frequent
events may need shorter time buckets
to limit size
● This may depend on the space-time
scales of underlying
systems/processes
● E.g. Spatial and temporal scales of
oceanographic processes (left)
Another
Option:
Cassandra
Apache
Lucene
Index Plugin
A Concordance
© 2020 Instaclustr Pty Limited
Cassandra Lucene Index Plugin
• The Cassandra Lucene Index is a plugin for Apache Cassandra:
o That extends its index functionality to provide near real-time search, including full-text search capabilities and free
multivariable, geospatial and bitemporal search
o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of
the cluster indexes its own data.
• Instaclustr now supports the plugin
o Optional add-on to managed Cassandra service
o And code support
https://github.com/instaclustr/cassandra-lucene-index
• How does this help for Geospatial queries?
o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search,
geo shape search, multiple distance units, geo transformations, and complex geo shapes.
© 2020 Instaclustr Pty Limited
Cassandra Table and
Lucene Indexes
• Geopoint Example
• Under the hood indexing is
done using a tree structure
with geohashes
(configurable precision)
CREATE TABLE latlong_lucene (
geohash1 text,
value double,
time timestamp,
latitude double,
longitude double,
Primary key (geohash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE CUSTOM INDEX latlong_index ON latlong_lucene ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
geohash1: {type: "string"},
value: {type: "double"},
time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"},
place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
}'
};
Search Options
• Sort
• Sophisticated but complex
semantics (see the docs)
SELECT value FROM latlong_lucene
WHERE expr(latlong_index,
'{ sort: [ {field: "place", type:
"geo_distance", latitude: " + <lat> + ",
longitude: " + <long> + "}, {field: "time",
reverse: true} ] }') and
geohash1=<geohash> limit 50;
© 2020 Instaclustr Pty Limited
Search Options
• Building Box Filter
• Need to compute box
corners
SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: {
type: "geo_bbox", field: "place",
min_latitude: " + <minLat> + ",
max_latitude: " + <maxLat> + ",
min_longitude: " + <minLon> + ",
max_longitude: " + <maxLon> + " }}’)
limit 50;
© 2020 Instaclustr Pty Limited
Search Options
• Geo Distance Filter
SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: { type:
"geo_distance", field: "place", latitude: " +
<lat> + ", longitude: " + <long> + ",
max_distance: " <distance> + "km" } }') and
geohash1=' + <hash1> + ' limit 50;
© 2020 Instaclustr Pty Limited
Search Options: Prefix Filter
Prefix search is useful for searching larger areas over a single geohash column
as you can search for a substring
SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: [ {type:
"prefix", field: "geohash1", value:
<geohash>} ] }') limit 50
Similar to inequality over clustering column
© 2020 Instaclustr Pty Limited
Lucene
Results
Options = 2-25%
Best is prefix filter (25%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter
bounded box
Lucene filter geo
distance
Lucene filter prefix
over geohash
© 2020 Instaclustr Pty Limited
Overall
Geohash options are
faster (25%, 34%)
34% -> 6.5 Billion
Anomaly checks/day
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
Lucene filter prefix
over geohash
G
e
o
h
a
s
h
G
e
o
h
a
s
h
© 2020 Instaclustr Pty Limited
Overall
Geohash options are
faster (25%, 34%)
Lucene bounded box/geo
distance most accurate but only
5% of baseline performance
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
Lucene filter prefix
over geohash
L
u
c
e
n
e
L
u
c
e
n
e
© 2020 Instaclustr Pty Limited
3D
(Up and
Down)
Who needs it?
© 2020 Instaclustr Pty Limited
Location, Altitude and Volume
• 3D Geohashes
represent 2D location, altitude,
and volume
• A 3D geohash is a cube
© 2020 Instaclustr Pty Limited
Application?
3D Drone
Proximity
Detection
Proximity
rules
> 50m from people and
property
>150m from congested
areas
> 1000m from airports
> 5000m from exclusion
zones
Just happen to
correspond to different
length 3D geohashes,
3D Geohashes
Works with all the
geohash index options
Therefore reasonably fast
to compute 3D proximity
More accurate slower
options can be improved
with bigger Cassandra
clusters
© 2020 Instaclustr Pty Limited
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
Lucene filter prefix over
geohash
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h
3
D
G
e
o
h
a
s
h
• Demo 3D Geohash java code
o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762
o produces valid 3D geohashes for altitudes from 13km below sea level to
geostationary satellite orbit
o Can be used for any of the geohash options
More Information?
© 2020 Instaclustr Pty Limited
More Information?
© 2020 Instaclustr Pty Limited
• Blogs: https://www.instaclustr.com/paul-brebner/
• Try us out! Free trial at Instaclustr.com
©Instaclustr Pty Limited, 2020
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!

More Related Content

What's hot

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...DataStax Academy
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0J.B. Langston
 
InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strataPatrick McFadin
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Introduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in JapanIntroduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in Japaninside-BigData.com
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingMongoDB
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...DataStax
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 

What's hot (20)

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Aggregation - What's it to The HDF Group
Aggregation - What's it to The HDF GroupAggregation - What's it to The HDF Group
Aggregation - What's it to The HDF Group
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Introduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in JapanIntroduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in Japan
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB Sharding
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
 
Boston hug
Boston hugBoston hug
Boston hug
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 

Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

New idc architecture
New idc architectureNew idc architecture
New idc architectureMason Mei
 
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
RasterFrames: Enabling Global-Scale Geospatial Machine LearningRasterFrames: Enabling Global-Scale Geospatial Machine Learning
RasterFrames: Enabling Global-Scale Geospatial Machine LearningAstraea, Inc.
 
RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018Simeon Fitch
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...InfluxData
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
 
Database@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use themDatabase@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use themTammy Bednar
 
Gvf Future Sat Africa 2017 Newspace
Gvf  Future Sat Africa 2017 NewspaceGvf  Future Sat Africa 2017 Newspace
Gvf Future Sat Africa 2017 NewspaceMyles Freedman
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterACSG Section Montréal
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
The Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik SeeleyThe Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik Seeleylucenerevolution
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance ComputingAbzetdin Adamov
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backVictor_Cr
 

Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020) (20)

New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
RasterFrames: Enabling Global-Scale Geospatial Machine LearningRasterFrames: Enabling Global-Scale Geospatial Machine Learning
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
 
RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
 
Crocotta R&D - Virtual Universe
Crocotta R&D - Virtual UniverseCrocotta R&D - Virtual Universe
Crocotta R&D - Virtual Universe
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Database@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use themDatabase@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use them
 
Gvf Future Sat Africa 2017 Newspace
Gvf  Future Sat Africa 2017 NewspaceGvf  Future Sat Africa 2017 Newspace
Gvf Future Sat Africa 2017 Newspace
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
hadoop事例紹介
hadoop事例紹介hadoop事例紹介
hadoop事例紹介
 
The Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik SeeleyThe Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik Seeley
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes back
 
Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 

More from Paul Brebner

The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersPaul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Paul Brebner
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaPaul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialPaul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Paul Brebner
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Paul Brebner
 
Introduction to programming class 13
Introduction to programming   class 13Introduction to programming   class 13
Introduction to programming class 13Paul Brebner
 

More from Paul Brebner (20)

The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache Kafka
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache Kafka
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and Potential
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...
 
Introduction to programming class 13
Introduction to programming   class 13Introduction to programming   class 13
Introduction to programming class 13
 

Recently uploaded

AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 

Recently uploaded (20)

AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

  • 1. Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra Paul Brebner instaclustr.com Technology Evangelist ApacheCon Geospatial Track 2020 © Instaclustr Pty Limited, 2020
  • 3. Overview • “Location” in the news in 2020 • Anomaly Detection: baseline throughput • Spatial Anomaly Detection problem • Solutions: location representation and querying/indexing o Bounding boxes and secondary indexes o Geohashes o Lucene index o Results o 3D © 2020 Instaclustr Pty Limited
  • 4. In the News John Conway Legendary Polymath Passed away this year © 2020 Instaclustr Pty Limited
  • 5. Game of Life Next state of each cell depends on state of immediate neighbors © 2020 Instaclustr Pty Limited
  • 6. Simple rules but complex patterns Game of Life © 2020 Instaclustr Pty Limited
  • 7. Also In the News Social distancing and COVID-19 tracing Uncle Ron’s social distancing 3000 invention Or COVIDSAFE App? © 2020 Instaclustr Pty Limited
  • 8. Uncle Ron’s social distancing 3000 invention Also In the News And “planet-killer” asteroid missed the Earth in April (16x moon orbit) © 2020 Instaclustr Pty Limited
  • 9. Previously… KAFKA, CASSANDRA AND KUBERNETES AT SCALE: REAL- TIME ANOMALY DETECTION ON 19 BILLION EVENTS A DAY
  • 10. Anomaly Detection Spot the difference At speed (< 1s RT) and scale (high throughput, lots of data) © 2020 Instaclustr Pty Limited
  • 11. How Does It Work? • CUSUM (Cumulative Sum Control Chart) • Statistical analysis of historical data • Data for a single variable/key at a time • Potentially billions of keys © 2020 Instaclustr Pty Limited
  • 12. Pipeline Design • Kafka, Cassandra and Kubernetes Clusters – Kafka handles streaming, Cassandra for data storage, Kubernetes for application scaling • Efficient Cassandra Data writes and reads with key, a unique “account ID” or similar © 2020 Instaclustr Pty Limited Kubernetes
  • 13. Cassandra Data Model • Events are timeseries • Id is Partition Key - Time is clustering key (order) • Read gets most recent 50 values for id, very fast create table event_stream ( id bigint, time timestamp, value double, primary key (id, time) ) with clustering order by (time desc); select value from event_stream where id=314159265 limit 50; © 2020 Instaclustr Pty Limited
  • 14. Cluster details – CPU cores Cassandra 384 + Application pipeline (workers) 118 + Kafka 72 = 574 Cores Total 384 118 72 0 100 200 300 400 500 600 700 Cores Cores Used 574 Total Cassandra Workers Kafka
  • 15. Baseline Throughput 19 Billion Anomaly Checks/Day = 100% 0 20 40 60 80 100 120 Baseline (single transaction ID) Normalised (%) © 2020 Instaclustr Pty Limited
  • 16. Harder Problem: Spot the Differences in Space Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy © 2020 Instaclustr Pty Limited
  • 17. Spatial Anomalies Many and varied © 2020 Instaclustr Pty Limited
  • 18. Real Example: John Snow No, not this one! © 2020 Instaclustr Pty Limited
  • 19. John Snow’s 1854 Cholera Map • Death’s per household + location • Used to identify a polluted pump (X) • Some outliers—brewers drank beer not water! X © 2020 Instaclustr Pty Limited
  • 20. But… First you have to know where you are: Location To usefully represent location need: • Coordinate system • Map • Scale © 2020 Instaclustr Pty Limited
  • 21. Better • <lat, long> coordinates • Scale • Interesting locations, “bulk of treasure here” © 2020 Instaclustr Pty Limited
  • 22. Geospatial Anomaly Detection South Atlantic Geomagnetic Anomaly New problem… • Rather than a single ID, events now have a location (and a value) • The problem now is to o find the nearest 50 events to each new event o Quickly (< 1s RT) • Can’t make any assumptions about geospatial properties of events o including location, density or distribution – i.e. where, or how many o Need to search from smallest to increasingly larger areas o E.g. South Atlantic Geomagnetic Anomaly is BIG • Uber (ApacheCon 2019 talk) uses similar • technologies to o forecast demand o Increase area until they have sufficient data for predictions • Can we use <lat, long> as Cassandra partition key? o Yes, compound partition keys are allowed. o But can only select the exact locations. © 2020 Instaclustr Pty Limited
  • 23. How to Compute Nearness To compute distance between locations you need coordinate system e.g. Mercator map Flat earth, distortion nearer poles © 2020 Instaclustr Pty Limited
  • 24. World is (approx.) Spherical Calculation of distance between two latitudinal/longitudinal points is non-trivial © 2020 Instaclustr Pty Limited
  • 25. Bounding Box Approximation of distance using inequalities © 2020 Instaclustr Pty Limited
  • 26. Bounding Boxes and Cassandra? • Use ”country” partition key, Lat/long/time clustering keys • But can’t run the query with multiple inequalities CREATE TABLE latlong ( country text, lat double, long double, time timestamp, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC); select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50; InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)" © 2020 Instaclustr Pty Limited
  • 27. Secondary Indexes to the Rescue? Secondary Indexes o create index i1 on latlong (lat); o create index i2 on latlong (long); • But same restrictions as clustering columns. SASI - SSTable Attached Secondary Index • Supports more complex queries more efficiently o create custom index i1 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; o create custom index i2 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex’; • select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering; • “allow filtering” may be inefficient (if many rows have to be retrieved prior to filtering) and isn’t suitable for production. • But SASI docs say o even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place © 2020 Instaclustr Pty Limited
  • 28. Results Very slow (< 1%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI © 2020 Instaclustr Pty Limited
  • 29. Geohashes to the Rescue? • Divide maps into named and hierarchical areas • We’ve been something similar already: “country” partition key E.g. plate tectonics © 2020 Instaclustr Pty Limited
  • 30. Geohashes Rectangular areas Variable length base-32 string Single char regions 5,000km x 5,000km Each extra letter gives 32 sub-areas 8 chars is 40mx20m En/de-code lat/long to/from geohash But: Edges cases, non-linear near poles © 2020 Instaclustr Pty Limited
  • 31. Some Geohashes Are Words “ketchup” is in Africa © 2020 Instaclustr Pty Limited
  • 32. Some Geohashes Are Words 153mx153m153m x 153m © 2020 Instaclustr Pty Limited
  • 33. “Trump” Is in Kazakhstan! Not to scale 5kmx5km © 2020 Instaclustr Pty Limited
  • 34. Modifications for Geohashes • Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra © 2020 Instaclustr Pty Limited
  • 35. Geohashes and Cassandra In theory Geohashes work well for database indexes CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2); Option 1: Multiple Indexed Geohash Columns © 2020 Instaclustr Pty Limited
  • 36. • Query from smallest to largest areas select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50; Tradeoffs? Multiple secondary columns/indexes, multiple queries, accuracy and number of queries depends on spatial distribution and density • Stop when 50 rows found © 2020 Instaclustr Pty Limited
  • 37. Results Slightly better Option 1 = 10% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 © 2020 Instaclustr Pty Limited
  • 38. Geohashes and Cassandra Denormalization is “Normal” in Cassandra Create 8 tables, one for each geohash length CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); Option 2: Denormalized Multiple Tables © 2020 Instaclustr Pty Limited
  • 39. • Select from smallest to largest areas using corresponding table select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50; Tradeoffs? Multiple tables and writes, multiple queries © 2020 Instaclustr Pty Limited
  • 40. Results Getting better Option 2 = 20% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 © 2020 Instaclustr Pty Limited
  • 41. Geohashes and Cassandra Similar to Option 1 but using clustering columns CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC); Option 3: Clustering Column(s) © 2020 Instaclustr Pty Limited
  • 42. How Do Clustering Columns Work? • Clustering columns are good for modelling and efficient querying of hierarchical/nested data • Query must include higher level columns with equality operator, ranges are only allowed on last column in query, lower level columns don’t have to be included. For example: o select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50; • But why have multiple clustering columns when one is actually enough… Good for Hierarchical Data © 2020 Instaclustr Pty Limited
  • 43. Better: Single Geohash Clustering Column Geohash8 and time are clustering keys CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC); © 2020 Instaclustr Pty Limited
  • 44. Inequality Range Query • With decreasing length geohashes • Stop when result has 50 rows select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50; © 2020 Instaclustr Pty Limited
  • 45. Geohash Results Option 3 is best = 34% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 © 2020 Instaclustr Pty Limited
  • 46. Issues? ■ Cardinality for partition key ● should be > 100,000 ● >= 4 character geohash ■ Unbounded partitions are bad ● May need composite partition key in production ● e.g. extra time bucket (hour, day, etc) ■ Space vs time ● could have different sized buckets for different sized spaces ● E.g. bigger areas with more frequent events may need shorter time buckets to limit size ● This may depend on the space-time scales of underlying systems/processes ● E.g. Spatial and temporal scales of oceanographic processes (left)
  • 48. Cassandra Lucene Index Plugin • The Cassandra Lucene Index is a plugin for Apache Cassandra: o That extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. • Instaclustr now supports the plugin o Optional add-on to managed Cassandra service o And code support https://github.com/instaclustr/cassandra-lucene-index • How does this help for Geospatial queries? o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, multiple distance units, geo transformations, and complex geo shapes. © 2020 Instaclustr Pty Limited
  • 49. Cassandra Table and Lucene Indexes • Geopoint Example • Under the hood indexing is done using a tree structure with geohashes (configurable precision) CREATE TABLE latlong_lucene ( geohash1 text, value double, time timestamp, latitude double, longitude double, Primary key (geohash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE CUSTOM INDEX latlong_index ON latlong_lucene () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { geohash1: {type: "string"}, value: {type: "double"}, time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"}, place: {type: "geo_point", latitude: "latitude", longitude: "longitude"} }' };
  • 50. Search Options • Sort • Sophisticated but complex semantics (see the docs) SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ sort: [ {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50; © 2020 Instaclustr Pty Limited
  • 51. Search Options • Building Box Filter • Need to compute box corners SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}’) limit 50; © 2020 Instaclustr Pty Limited
  • 52. Search Options • Geo Distance Filter SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km" } }') and geohash1=' + <hash1> + ' limit 50; © 2020 Instaclustr Pty Limited
  • 53. Search Options: Prefix Filter Prefix search is useful for searching larger areas over a single geohash column as you can search for a substring SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50 Similar to inequality over clustering column © 2020 Instaclustr Pty Limited
  • 54. Lucene Results Options = 2-25% Best is prefix filter (25%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash © 2020 Instaclustr Pty Limited
  • 55. Overall Geohash options are faster (25%, 34%) 34% -> 6.5 Billion Anomaly checks/day 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash G e o h a s h G e o h a s h © 2020 Instaclustr Pty Limited
  • 56. Overall Geohash options are faster (25%, 34%) Lucene bounded box/geo distance most accurate but only 5% of baseline performance 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash L u c e n e L u c e n e © 2020 Instaclustr Pty Limited
  • 57. 3D (Up and Down) Who needs it? © 2020 Instaclustr Pty Limited
  • 58. Location, Altitude and Volume • 3D Geohashes represent 2D location, altitude, and volume • A 3D geohash is a cube © 2020 Instaclustr Pty Limited
  • 60. Proximity rules > 50m from people and property >150m from congested areas > 1000m from airports > 5000m from exclusion zones Just happen to correspond to different length 3D geohashes,
  • 61. 3D Geohashes Works with all the geohash index options Therefore reasonably fast to compute 3D proximity More accurate slower options can be improved with bigger Cassandra clusters © 2020 Instaclustr Pty Limited 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h
  • 62. • Demo 3D Geohash java code o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762 o produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit o Can be used for any of the geohash options More Information? © 2020 Instaclustr Pty Limited
  • 63. More Information? © 2020 Instaclustr Pty Limited • Blogs: https://www.instaclustr.com/paul-brebner/ • Try us out! Free trial at Instaclustr.com