MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Geo Searches for Health Care Pricing Data
Robert Stewart
Senior Architect, Castlight Health
rstewart@castlighthealth.com
@wombatnation
1

Castlight Health
The Business and Technical Problems
Initial Solution
MongoDB, Geo Haystack Index and SSDs
Replica Set Flipping
2

 Hosted web and mobile applications providing
unbiased information on health care cost and quality
 Customers are employers and health plans
 Founded in 2008, raised $181 million in VC funding
 #1 on Wall Street Journal’s list of “Top 50 Venture-
Backed Companies” for 2011
 Hiring!
Castlight Health
3

Business Problem
6
 Support searches for
 Prices for a procedure performed by any in-network provider in a
geographical area
 Prices for all procedures performed by a single provider
 Sub-second response, even if returning data on
thousands of prices

 Need a very fast geo index
 Rate count doubled in last 3 months to 600 million
 Major rate updates monthly
 Difficult to index data to ensure sequential reads
 Sometimes lots of random reads
Technical Problems
7

Pricing Retrieval Architecture
8
User
Castlight
Web Browser
Mobile Web
Browser
Native Mobile
Application
Castlight Web
App
Castlight Mobile
Web App
Proxy Service
Search Service
Pricing Service
Prices

Initial Solution
9
 Store pricing data in MySQL
 When Pricing Service starts, create two in-memory
indexes and cache most of the rates
 55 GB JVM Heap with lots of GC tuning
 20-minute service startup time to build indexes
 3 hours for background caching of most rates
 Trouble Brewing:
 Total rates growing quickly
 Rolling restart becoming unacceptably slow
 If rates not in Java or MySQL cache, retrieval was very slow

Enter the Mongo
10

Geo Indexes
11
 Tried standard geo 2D indexes in MongoDB
 Too slow for my use case
 Geo Haystack index
 Conceptually similar
 From docs.mongodb.org
 “A haystack index is a special index that is optimized to return
results over small areas. Haystack indexes improve performance
on queries that use flat geometry.”

Mercator Projection with 10 degree grid
12

Geo Haystack
13
 We chose degrees long-lat for x-y coordinate system
 25 miles is our default search radius
 Roughly 0.5 degrees in middle of the US
db.priceables_1.ensureIndex(
{ loc: "geoHaystack", pm: 1 },
{ bucketSize: 0.5 })
db.runCommand(
{ geoSearch: "priceables_1",
near: [-122.4, 37.79],
maxDistance: 0.5,
search: { pm: 6757 },
limit: 50000 })
 maxDistance calculated using great circle algorithm

Geo Haystack Pros
14
 Very fast when retrieving many documents in a
relatively small search radius
 Great when you also need to apply a secondary filter
 Compound 2dsphere index in Mongo 2.4 has even better support

Geo Haystack Cons
15
 Supports only one extra filter in index
 SERVER-2979
 A bug if unindexed query on only the second part of
the key
 SERVER-8645
> db.priceables_1.find({pm: 6757})
error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" }
 Second part of index can’t have an array value
 Location part of key can’t be null

SSDs
16
 For uncached data on HDD, Geo Haystack was twice as
fast as custom Java geo index and MySQL
 Still close to 1 minute for big queries with full data set
 Death by random read
 Tested with a $200 Samsung SSD
 Typical query dropped to 20 millis
 Big query only about 150 millis

Random 4k block reads, 5 GB file, 16 threads
Mongoperf on SSDs
17
Env SSD Read Ops/s Read MB/s
Prod Samsung 200GB SLC 74k 288
QA VM Samsung 200GB SLC 30k 117
Dev Samsung 830 256GB SATA MLC 47k 183
Env SSD Write Ops/s Write MB/s
Prod Samsung 200GB SLC 1074 289
QA VM Samsung 200GB SLC 405 196
Dev Samsung 830 256GB SATA MLC 438 210
Sequential write of the 5 GB file

 Requirements
 Major price updates monthly
 Minor updates more frequently
 Huge bulk loads with no impact on active replica set
 I/O bound, not CPU bound
Low Impact Pricing Updates
18

 Two replica sets
 Lowered cost with two SSDs on each pricing server
 scp compressed files from QA to passive replica set
 Protip: to compress and uncompress
tar cvf - pricing | pigz > ~/pricing.tgz
pigz -dc pricing.tgz | tar xvf -
 Page in index and data
 db.runCommand({ touch: "priceables_1", index: true, data: true })
 Pricing Service operation to atomically flip
Replica Set Flipping Solution
19

Replica Set Architecture
20
Physical Servers
Replica
Sets
prodpricing1
prodpricing2
Server pricing1
mongod 28001
primary
mongod 28002
secondary
Server pricing2
mongod 28001
secondary
mongod 28002
primary
Server db1
mongod 28001
arbiter
Server db2
mongod 28002
arbiter

 Obviously, increased cost, but only for SSDs
 Recently added caching of remote pricing lookups
 TTL collections
 Cache is lost during a flip
 But, usually flip late at night
 Cache eviction time is only a few hours
Replica Set Flipping Drawbacks
21

 Geo search speed with cold cache acceptable
 Geo search speed with warm cache awesome
 Pricing Service startup down to a few seconds
 No production impact for major rate updates
 Lowered risk for minor rate updates
Overall Results
22

Summary
23
 Geo Haystack Index great for …
 Retrieving lots of documents in a constrained search area
 Geo searches with a secondary filter
 SSDs great for …
 Random reads
 Reducing need for lots of complex indexes
 Replica set flipping great for …
 Instant swap of large amounts of data
 Primarily, if not solely, read only
 Trading cost for operational flexibility

Q & A
24

MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

Ähnlich wie MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health (20)

Mehr von MongoDB

Mehr von MongoDB (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

Hinweis der Redaktion