2. Who is wordnik Word + Meaning Discovery Engine Clustered Application built with: Scala/Java/Jetty Only way in is via REST 19M API calls/day @ 7ms/query average Physical servers 72GB RAM, 8 core 4.3TB DAS We’re MongoDB users for ~1.5 yrs Used in master/slave 14B documents in MongoDB
3. Why a graph for words Technique to model network relationships Properties are dynamic Links are “arbitrary” Runtime performance Answers in < 5ms/request Routing functions based on goals “find most likely word for X” “find more common form of Y”
4. Why a graph for words Misspellings, abbreviations, texting, Twitter
5. More about graphs Different types of Graphs Decisions have huge impact on design + implementation Nodes (vertices) String and numeric properties Edges (links) Finite set of labeled edge types (~30) Multiple target nodes per edge Each potentially different weight Directed, non-symmetrical
6. Why build on Mongodb? Word Graph is core to Wordnik Many ways to build a graph Dedicated graph DBs Relational DBs MongoDB Document Storage Uber-flexible Successfully routes in < 5ms Long runway for scale-out Limit storage infrastructure components Easy to implement
7. Wordnik graph data model Nodes _id field holds name, object type Index at no extra cost Arbitrary number of properties Only two datatypes for us, String, Double Node type info in node ID (_id) na_corpusCount => Double sa_source => String
8. Wordnik graph data model Edges Destination(s) Weight Link Properties Stored in Mongo Arrays Array size is app limited Use $push, $pop
9. Access to mongo Mongo Access via DAO layer Limit queries to ones that work“well” ALL queries use index Find Node “cat” of type “word”: db.node.findOne({_id:"cat|word"}) Find Edge types for above: db.edge.find({_id:/^catword/},{_id:1}) Serialization/deserialization Done “the old-fashioned way” BasicDBObject, BasicDBList faster than mappers for our use case
11. Routing, traversals, functions Typically find path from A to B Routes have costs Low cost or high probability Our use case is atypical LinkedIn vs. Maps Not from A to B More like “from A with 3 hops” This matters!
13. Performance + scaling Query by index only Use regex syntax in restricted fashion Starts with only No look behind Case sensitive Boring? Fast? Sharding is a no-brainer What about ObjectId()?
14. Performance + scaling Horizontal? Vertical? Both? And when? Separate collections by edge type/object type Increases storage needs Collections all have padding, 30 collections => ~30x padding Sharding Use slick, built-in Mongo sharding Roll your own based on your data What does Wordnik do? Neither! (yet) 30M Nodes, 50M Edges One collection for nodes One collection for edges
15. Performance + scaling Selecting a shard key Done in application logic based on OUR data Depends on what you need
16. End result Solves Wordnik Graph infrastructure needs Store Word nodes with UGC, corpus, structured, analytical data Batch fetch Edges @ > 50k/second Find Edge + endpoints in 80mS Powers our… Word Selection Canonicalization Misspelling “Did you mean” logic Classification + Matching Engine
19. examples Applied Word Graph Recall: “Computers are stupid” English is complex Clustering + classification algorithms: Stink without consistent data “The” => “the” (duh) “geese” => “goose” (ok) Stink when they’re slow Graph + Clustering/Classification Just add data
20. MongoDB makes a Great graph back-end See more about Wordnik APIs: http://developer.wordnik.com Further Reading Migrating from MySQL to MongoDB http://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnik Maintaining your MongoDB Installation http://www.slideshare.net/fehguy/mongo-sv-tony-tam Source Code Mapping Benchmark https://github.com/fehguy/mongodb-benchmark-tools Wordnik OSS Tools https://github.com/wordnik/wordnik-oss