3. About:
David
Smiley
• So2ware
Engineer
(16
years)
• Search
(7
years)
• Java
(full-‐stack),
Web,
SpaGal
• Freelance
search
consultant
/
developer
• Apache
Lucene
/
Solr
commiKer
&
PMC
• Wrote
first
book
on
Solr,
updated
twice
4. Agenda
• About
this
project
• Architecture
• Solr
&
Gme
sharding
• Experiences
with:
– Kotlin,
Dropwizard,
Swagger
– KaUa
– Docker,
Kontena
• Solr
for
geo-‐enrichment
• Solr
adapter
for
Lucene
BKD
Lat-‐Lon
point
search
&
sort
• Heatmaps
– ExisGng
funcGonality
• demo
– New
funcGonality
5. H-‐Hypermap
/
BOP
• Harvard
University,
CGA:
Center
for
GeospaGal
Analysis
hKp://gis.harvard.edu
• Harvard
Hypermap
Project
– Managed
by
Ben
Lewis
• BOP
“Billion
Object
Pla^orm”
– Funded
by
the
Sloan
FoundaGon
6. BOP
Requirements
Summary
• Most
recent
~billion
geo-‐tweets
• RealGme
search
(<5
sec
latency)
• Sub-‐second
queries
– Including
heatmaps!
• On
the
cheap:
~6
mediocre
boxes
Provide
a
proof-‐of-‐concept
pla^orm
designed
to
lower
the
barrier
for
researchers
who
need
to
access
big
streaming
spaGo-‐temporal
datasets.
7. Logical
High-‐Level
Architecture
Archival
RealGme
HarvesGng
Enrichment
various
clients...
various
clients...
Data
flows
via
Apache
KaLa
Systems
expose
HTTP
web
services
“BOP”
8. Shard:
W51
The
BOP
KaUa
Topic
Ingester
ZooKeeper
Shard:
W52
Shard:
W53
Shard:
W54
Shard:
RT
...
Web-‐
Service
KaUa
Streams
• Create
Solr
doc
• Routes
to
shard
REST/JSON
API
• Keyword
search
• FaceGng
• Heatmaps
• CSV
export
...
9. BOP
Solr
Sharding
Architecture
RealGme
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
…
4-‐5
mo.
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
…
4-‐5
mo.
G_North_America
G_Elsewhere
Lone
RealGme
CollecGon/Shard.
1-‐25
hrs
Copy
then
delete,
at
night
• RealGme
shard
is
where
realGme
search
happens.
No
caches,
but
small.
• Primary
collecGons
have
useful
caches
• Housekeeping
Tasks:
• Move
data
from
RT
to
primary
• Create
new
shards;
expire
old
• Merge/opGmize
shards
10. Building
a
Search
Web-‐Service
• Kotlin
language
(JVM
based)
– Nullity
as
first-‐class
language
feature
• DropWizard
framework
– Designed
for
web-‐services
• Swagger
– Dynamically
generated
dev
UI
for
web-‐services
11. Apache
KaUa
• KaUa:
a
scalable
message/queue
pla^orm
• See
new
KaUa
Streams
&
KaUa
Connect
APIs
• No
back-‐pressure;
can
be
a
challenge
• Non-‐obvious
use:
– For
storage;
Gme
parGGoning
• Lots
of
benefits
yet
serious
limitaGons
12. Docker
• Easy
to
find/try/use
so2ware
– No
installaGon
– Simplified
configuraGon
(env
variables)
– Common
logging
– Isolated
• Ideal
for:
– ConGnuous
Int.
servers
– Trying
new
so2ware
– ProducGon
advantages
• But
“new”
13. Docker
in
ProducGon
• I
use
“Kontena”
• Common
logging,
machine/proc
stats,
security
– VPN
to
secure
network;
access
everything
as
local
• No
longer
need
to
care
about:
– Ansible,
Chef,
Puppet,
etc.
– Security
at
network
or
proxy;
not
service
specific
• Challenges:
state
&
big-‐data
14. Enrichment
Geo:
Query
Solr
via
spaGal
point
query;
aKach
related
metadata
to
tweet
KaUa
Topic
Enrich
KaUa
Topic
TwiKer
SenGment
Classifier
Geo:
Solr
with
regional
polygons
&
metadata
15. Solr
for
Geo
Enrichment
• Tweets
(docs)
can
have
a
geo
lat/lon
• Enrich
tweet
with
Country,
State/Province,
…
– GazeKeer
lookup
(point-‐in-‐polygon)
Data
Set
Features
Raw
size
Index
?me
Index
size
Admin2
46,311
824
MB
510
min
892
MB
US
States
74,002
747
MB
4.9
min
840
MB
MassachuseKs
Census
Blocks
154,621
152
MB
5.9
min
507
MB
16. Fast
Point-‐in-‐Polygon
Tricks
Index/Config
• OpGmize
to
1
segment
• RptWithGeometry
SpaGalField
– precisionModel=
"floating_single"
– autoIndex="true"
• <cache
name=
"perSegSpatial
FieldCache_WKT"
…
Search
• Embed
Solr
(in-‐process)
• Use
docValues,
not
stored
– fl=block:field(GEOID10)
Query
like
this:
• q={!field
cache=false
f=WKT}
Intersects(POINT(
$lon
$lat))
Sub-‐Millisecond!
17. Lucene
“LatLonPoint”
• Uses
new
PointValues
(BKD
index)
in
Lucene
6
• Fastest:
hKp://home.apache.org/~mikemccand/geobench.html
• Presently
in
Lucene
sandbox
module
• Some
limitaGons:
WGS84
points
only
• Credit
to
Rob
Muir
and
Mike
McCandless
18. Solr
Adapter
For
LatLonPoint
• New
Solr
FieldType
for
Lucene
LatLonPoint
– Filter
points
by
circle,
rect,
polygon
– Distance
sort;
but
no
boos(ng
Coming
soon!
Solr
6.4?
19. Heatmaps:
SpaGal
Grid
FaceGng
• SpaGal
density
summary
grid
faceGng,
also
useful
for
point-‐plovng
search
results
• Lucene
&
Solr
APIs
• Scalable
&
fast
usually…
• Usually
rendered
with
a
gradient
radius
-‐>
• See:
hKp://spacemansteve.github.io/
leaflet-‐solr-‐heatmap/example/index.html
21. New
HeatmapSpaGalField
• Why?
– With
new
BKD/PointValues,
no
“RPT”
field
to
use
– Scalable
for
heatmaps;
don’t
worry
about
search
• Scalable
at
all
resoluGons;
many
millions
of
docs/shard
– Can
be
specific
about
grid
resoluGons
Coming
soon!
Solr
6.4?
22. Heatmaps
with
Stats
• Instead
of
counGng
docs;
calculate
a
metric
– Ex:
avg(minuteOfDay)
• Will
require
JSON
Facet
API
• Inherently
slower
than
just
doc
counts
Coming
soon!
Solr
6.4?
23.
24.
25. Final
Remarks
• Open-‐Source
– hKps://github.com/dsmiley/hhypermap-‐bop
• In-‐progress
• Improvements
to
Solr
expected
to
be
available
before
December;
officially
in
Solr
6.4.