Wanderu is a consumer-focused search engine for buses and trains. In this webinar, we will recount the architectural, modeling and other technical "lessons learned" and "lessons unlearned" in implementing our geospatial and search features using Neo4j in the context of a NoSQL polyglot solution.
Speaker: Eddy Wong, CTO, Wanderu
A technologist, innovator and entrepreneur who has architected products and web sites for companies like Hasbro, Maark, Allurent, Macromedia, Allaire, Open Sesame, Philips and AT&T. He was the Chief Architect at Open Sesame where he built one of the first attribute-based personalization engines. Eddy has over 15 years of experience as a software architect and is a Boston tech-community leader in the areas of NoSQL, Big Data and Personalization. He is also the organizer of the Boston GraphDB Meetup.
4. From pt A to pt B
A: Boston B: DC
NYC
Nomenclature: Stations,Trips
Amtrak, $101, 09/26/2013
Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013
5. From pt A to pt B
B: Brooklyn, NY
A: Cambridge, MA 31st & 9th Ave, NYC
South Station, Boston
28st & 7th Ave, NYC
34st & 8th Ave, NYC
6. Our Story
• Tech Started about 1+ yr ago
• Beta in Mar, Launch in Aug
• Knew nothing about Neo4j when we
started (Jun 2012)
• Did not like the relational model: wanted
schema-less and no self-joins
• Wanted a graph model
10. Our Story
• Started with MongoDB as a general store:
easy to manipulate and organize data
• Wanted a db that could preserve the
Graph Model
• Debated: Document vs. Graph
• Could not find one single db that could do
both: general store + graph
12. noSQL
• You need to make a choice of one noSQL
database
• You need ONE (centralized) database
• The word “database” is a loaded term
• Lots of (very diff) noSQL dbs options
13. Our Situation
• Data is written only in one direction
• Users search for paths, then segments
• Searches are done by date
• Needed online capability
• Trip info (price/avail) could change on some
14. Our Solution
• Use Both: MongoDB + Neo4j
• “Docugraph” = Document + Graph
• Syncing two kinds of databases
• Eventual consistency
16. MongoConnector
• MongoDB Lab project, open source, unsupported
• Uses Replica Mechanism: Oplog
• Eventually Consistent (not real time)
• Written in Python
• Main methods: Upserts and Deletes, passes doc
• Implement DocMgr->Neo4jDocMgr->py2neo
• Other impls: MongoDocMgr, SolrDocMgr,
ESDocMgr
17. Populating Neo4j (2)
• Created our own way of creating Edges
• Auto Node creation when Edge is created:
Could add Stations (nodes) on the fly
• py2neo requires 2 “node ref”s to create an
edge, ie. might need two round trips to
Neo4j
18. Edge Creator P-code
hashtable allStations = load_stations
w_create_edge (station_id a, station_id b, otherdata)
look_up a in allStations
If found -> ref_a = allStations.get(a)
If not found ->
ref_a = py2neo.create_node(a)
Add a to allStations
...
py2neo.create_edge(ref_a, ref_b, ...)
19. Pipeline
Scraping JSON
Bus Websites Non-uniform
Data
MongoDB
Neo4j
Mongo
ConnNodes & Edges
Replica
Mechanism
REST
Server
BOS, NYC
BOS, PHL
NYC, DC
NYC, PHL
21. Our Story
• We tried to “dump” all data into Neo4j
• Stations -> Nodes,Trips -> Edges
• Problem: Edges had dates -> too many
Edges -> “Super Node”
• Query perf was terrible (1+ mins) and
worse as # edges increased
22. Our Story (2)
• Went from Cypher to Gremlin, thinking
that would have improve performance
• Needed range queries on Edges
23. Our Solution
• Don’t store everything in the Neo4j, only
metadata
• Use Neo4j as an index
• Don’t store entities in Nodes, only keys
• Don’t store heavy properties in Edges
25. Neo4j RuntimeModel
• Relationships are in a linked list
• Properties are in a linked list
• Therefore:There is NO random access for
Relationships or Properties
• A range query of relationships required a
full scan
26. Our Solution (2)
• Needed ability to do range queries on
Edges
• Serve paths from Neo4j, segments from
MongoDB
• The one thing we tried to avoid we ended
up doing: Joins
• Came up with “Docugraph” approach
27. Docugraph
• MongoDB Collections for Nodes and Edges
• Neo4j: Only keys for nodes
• Neo4j: Only Properties relevant for queries
30. Joins across DBs
MongoDB: Stations Neo4j: Nodes
BOS BOS
NYC NYC
DC DC
... ...
MongoDB: Trips Neo4j: Edges
BOS-NYC BOS-NYC
BOS-DC BOS-DC
NYC-DC NYC-DC
... ...
• Forget seq id
generated by dbs
• Use a human-created
long string for id
• Convert pair into id:
depart-arrive
• For example: BOS-
NYC
36. Geo
• Neo4j geo func was not out of the box
• Requires jar install
• Run a Java program to index
• Needed better doc
• Ended up using MongoDB geo instead
• Make geo func out of the box
37. Conclusions
• Even with a join across dbs -> solution
better than relational
• 10s paths x 100s segments vs. 500k x 500k
• Glad to have picked Neo4j: doing content
gen and more geo features now
• Graph model will be useful for future
analytics->Big Data