SlideShare ist ein Scribd-Unternehmen logo
1 von 45
MongoDB at Sailthru

         Ian White
        @eonwhite
      MongoNYC 2011
           6/7/11
Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
  wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
  Refinery 29, Flavorpill, Business Insider,
  Lot18, Fab, New York Observer
How We Got To
 MongoDB from SQL
• JSON was part of Sailthru infrastructure
  from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural fit
• Used for user profiles and analytics initially
• Migrated one table at a time (very, very
  carefully)
Sailthru Architecture
• User interface to display stats, build
  campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
  (PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modified-postfix SMTP servers (colo)
• 11 database servers on EC2 (for now)
MongoDB Overview

• 11 instances on EC2 (5 two-member
  replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs
Users are Documents

• Users aren’t records split among multiple
  tables
• End user’s lists, clickstream interests,
  geolocation, browser, time of day, purchase
  history becomes one ever-growing
  document
User Profile
{ "_id" : ObjectId("4b2d368aed948543a5fca4b4"), "browser" : { "Chrome" : 3, "Firefox" : 1, "iPhone" : 2 }, "click_count" : 1, "click_time" :
 "Wed Feb 17 2010 09:03:37 GMT-0500 (EST)", "client_id" : 450, "email" : "ibwhite@gmail.com", "email_hour" : { "13" : 1, "14" : 2, "16" : 2,
 "17" : 2, "18" : 3, "21" : 2 }, "geo" : { "city" : { "New York, NY US" : 3, "Sterling, VA US" : 1 }, "count" : 6, "country" : { "US" : 6 },
       "state" : { "NY US" : 3, "VA US" : 1 }, "zip" : { "10011 US" : 1, "10065 US" : 1 } }, "horizon" : { "admob" : 1, "advertising" : 3,
"afghanistan" : 1, "aig" : 2, "airline-industry" : 2, "alleyinsider" : 45, "analyst-research" : 1, "apple" : 25, "apple-tablet" : 5, "att" :
8, "bailout" : 5, "banks" : 6, "barack-obama" : 25, "ben-bernanke" : 1, "big-tech" : 17, "billionaires" : 1, "boats" : 1, "bonus" : 6, "bp" :
1, "budget" : 1, "cable" : 1, "caribbean" : 2, "cars" : 5, "chart-of-the-day" : 3, "china" : 3, "clusterstock" : 36, "cnbc" : 1, "comcast" :
   1, "commodities" : 3, "conan-obrien" : 6, "crime" : 2, "curbedcom" : 1, "death-of-tv" : 1, "debt" : 7, "deepwater-horizon-oil-spill" : 1,
      "dell" : 4, "development" : 1, "dick-fuld" : 1, "economy" : 10, "education" : 1, "employment" : 2, "entertainment" : 7, "europe" : 1,
   "facebook" : 4, "features" : 13, "financial-crisis" : 7, "financial-services" : 2, "fox" : 4, "fraud" : 1, "futures" : 1, "gadgets" : 21,
 "gas" : 1, "gawker" : 5, "gold" : 3, "goldman-sachs" : 1, "google" : 7, "green" : 5, "green-tech" : 2, "health" : 5, "health-care-reform" :
      7, "hedge-funds" : 3, "hires-and-fires" : 1, "housing-crisis" : 1, "hp" : 4, "hulu" : 2, "humor" : 1, "iad" : 1, "international" : 3,
    "investing" : 5, "ios" : 1, "ipad" : 2, "iphone" : 10, "jay-leno" : 5, "jim-cramer" : 1, "jobs" : 2, "john-gruber" : 2, "law-firms" : 1,
         "lawreview" : 3, "lehman-brothers" : 1, "litigation" : 5, "luxury" : 1, "mac" : 1, "magazines" : 1, "markets" : 7, "media" : 20,
        "mercedesbenz" : 4, "microsoft" : 1, "mining" : 1, "mobile" : 14, "mobile-ads" : 2, "moguls" : 1, "money" : 6, "money-media" : 2,
"moneygame" : 16, "morningstar" : 3, "mortgages" : 1, "mtv" : 1, "nbc" : 6, "new-york" : 1, "new-york-times" : 4, "news" : 9, "newspapers" :
   5, "nouriel-roubini" : 6, "oil" : 1, "online" : 10, "optimum-energy" : 4, "paul-krugman" : 3, "people" : 5, "politics" : 26, "radio" : 1,
     "real-estate" : 2, "recession" : 4, "regulation" : 12, "sai" : 15, "satellite-radio" : 1, "scandals" : 5, "security" : 1, "senate" : 4,
 "silicon-alley-insider" : 1, "sirius" : 1, "social-networking" : 3, "sports" : 1, "startups" : 1, "steve-jobs" : 1, "stimulus" : 1, "stock-
 market" : 5, "stocks" : 3, "tax-cuts" : 1, "taxes" : 1, "tbi" : 163, "tbi-live" : 3, "terrorism" : 3, "the-atlantic" : 1, "the-way-we-live-
  now" : 1, "themoneygame" : 3, "thewire" : 17, "tim-geithner" : 3, "time-warner-cable" : 1, "transportation" : 7, "treasury" : 2, "tv" : 7,
"tv-everywhere" : 1, "twitter" : 3, "uk" : 1, "unemployment" : 2, "us-government" : 8, "verizon" : 4, "video" : 6, "wall-st-cheat-sheet" : 1,
  "wall-street" : 25, "wall-street-journal" : 1, "warren-buffett" : 1, "white-house" : 4, "wwdc-2010" : 1, "yachts" : 1, "10gen" : 1, "2010-
   world-cup" : 1 }, "horizon_count" : 303, "horizon_time" : "Tue Dec 07 2010 15:26:35 GMT-0500 (EST)", "lists" : [ "TBI Research 1 - Beta",
"Dedicated Email", "TBI Research", "411" ], "lists_signup" : { "BI_iphone App" : null, "Clusterstock Chart Of The Day" : null, "Clusterstock
Select" : null, "Dedicated Email" : "Tue Dec 22 2009 13:29:43 GMT-0500 (EST)", "Dedicated Email - The Ladders" : null, "Green Sheet Select" :
null, "Insider 411" : null, "Insider 411 - Economist" : null, "Insider 411 - Ooyala" : null, "Insider 411 - The Wire Promo" : null, "Insider
  411- Economist" : null, "Law Review Select" : null, "Media Select" : null, "Silicon Alley Insider Chart Of The Day" : null, "Silicon Alley
     Insider Select" : null, "TBI Research" : "Tue Jan 05 2010 13:58:09 GMT-0500 (EST)", "TBI Research 1 - Beta" : "Mon Nov 09 2009 12:34:58
  GMT-0500 (EST)", "TBI Select" : null, "The Money Game Select" : null, "War Room Select" : null, "z_sailthru" : null, "10 Things Before the
      Opening Bell" : null, "411" : "Wed Jul 07 2010 11:28:03 GMT-0400 (EDT)" }, "open_count" : 11, "open_time" : "Tue Dec 07 2010 13:30:31
  GMT-0500 (EST)", "optout_templates" : [ ], "order" : 12, "signup_time" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "site_hour" : { "20" :
 1 }, "status" : null, "status_time" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "ts" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "urls" :
                            [ "http://www.businessinsider.com/" ], "urls_count" : 1, "vars" : { "name" : "eonwhite" } }
Profiles Accessible
       Everywhere
• Put abandoned shopping cart notifications
  within a mass email
{if profile.purchase_incomplete}
 <p>This is what’s in your cart:</p>
 {foreach profile.purchase_incomplete.items as item}
   {item.qty} <a href=”{item.url}”>{item.title}</a><br/>
 {/foreach}
{/if}
Profiles Accessible
       Everywhere
• Show a section of content conditional on
  the user’s location

{if profile.geo.city[‘New York, NY US’]}
  <div>Come to the New York Meetup on the 27th!</div>
{/if}
Profiles Accessible
        Everywhere
• Show different content depending on user
   interests as measured by on-site behavior
{select}
  {case horizon_interest('black,dark')}
    <img src="http://example.com/dress-image-black.jpg" />
  {/case}
  {case horizon_interest('green')}
    <img src="http://example.com/dress-image-green.jpg" />
  {/case}
  {case horizon_interest('purple,polka_dot,pattern')}
    <img src="http://example.com/dress-image-polkadot.jpg" />
  {/case}
{/select}
Profiles Accessible
        Everywhere
• Pick top content from a data feed based on
   tags


{content = horizon_select(content,10)}

{foreach content as c}
  <a href=”{c.url}”>{c.title}</a><br/>
{/foreach}
Other Advantages of
     MongoDB
• High performance
• Take any parameters from our clients
• Really flexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
  or reindexing
How We Run mongod
•   mongod --dbpath /path/to/db --logpath /path/to/log/
    mongodb.log --logappend --fork --rest --replSet
    main1 --journal


• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a firewall
• Use journaling now that it’s there
• Use --rest, it’s handy
Separate DBs By
       Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
  patterns
• Consider consequences of database failure/
  unavailability
• But make sure your backup and monitoring
  strategy is prepared for multiple DBs
Our Five Replica Sets
• main: most of the stuff on the UI, lots of
  small/medium collections
• horizon: realtime onsite browsing data
• profile: user profile data (60m user docs)
• message: last three months of emails
• archive: emails older than three months
Monitoring

• Some stuff to monitor: faults/sec, index
  misses, % locked, queue size, load average
• we check basic status once/minute on all
  database servers (SMS alerts if down), email
  warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product
Backups
• Used to use mongodump - don’t do that
  anymore
• Have single node of each replica set on a
  backup server
• Two-hour slave delay
• fsync/lock, freeze xfs file system, EBS
  snapshot, unfreeze, unlock
The Great EC2 EBS
  Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
  backup server, snapshots, and get
  operational within hours
• Wasn’t fun
EC2 Future Plans
• EC2 is great overall
• EBS performance a little too inconsistent
  (even with RAID 0 or10)
• Moving to relying on physical hardware
  (with SSD) in colo
• Retain some nodes and backups on EC2
• Let you know how it goes in a few months
DESIGN
Develop Your Mental
 Model of MongoDB

• You don’t need to look at the internals
• But try to gain a working understanding of
  how MongoDB operates, especially RAM
  and indexes
Big-Picture Design
        Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
  with your best YAGNI
“But premature
  optimization is evil”
• Knuth said that about code, which is
  flexible and easy to optimize later
• Data is not as flexible as code
• So doing some planning for performance is
  usually good when it comes to your data
Specific MongoDB
    Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (field names and types)
Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable fieldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
  client_id
Favor Human-Readable
     Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
  extra lookups
• Build human-readable references to save
  you doing lookups and manual joins
Example



• Store the Template and the Email as strings
    on the message object
•   { template: “Internal - Blast Notify”, email:
    “support-alerts@sailthru.com” }


• No external reference lookups required
• The tradeoff is basically just disk space
Embed vs Top-Level
     Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
  want to err on the side of embedding
• Don’t embed if the embedding could get
  huge
• Don’t feel too bad about denormalizing by
  embedding AND storing in a top-level
  collection
Typical Properties of
Top-Level Collections

• Independence: They don’t “belong”
  conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable
Embedding Pros
• Super-fast retrieval of document with
  related data
• Atomic updates
• “Ownership” of embedded document is
  obvious
• Usually maps well to code structures
Embedding Cons

• Harder to get at, do mass queries
• Does not size up infinitely, will hit 16MB
  limit
• Hard to create references to embedded
  object
• Limited ability to indexed-sort the
  embedded objects
If You Think You Can
          Embed
• You probably should
• I take advantage of embedding in my
  designs more often now than I did three
  years ago
• It’s a gift MongoDB gives you in exchange
  for giving up your joins
Design Example:
     User Permissions
• Users can have various broad permission
  levels for any number of clients
• For example, user ‘ploki’ might have
  permission level ‘admin’ for client 76 and
  permission level ‘reports_only’ for client
  450
How Will We Use This
      Data?

• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
  client for a given user
How Will This Data
      Grow?

• In the medium term, it will stay small
• Number of clients and number of users can
  both grow infinitely
Back in SQL-land

• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)
Should We Use a New
Top-Level Collection?
  db.client.user.save( {
    client_id: 76,
    username: ‘ploki’,
    permission: ‘admin’,
  });
  db.client.user.save( {
    client_id: 450,
    username: ‘ploki’,
    permission: ‘reports_only’,
  });

  db.client.user.ensureIndex( { client_id: 1 } );
  db.client.user.ensureIndex( { username: 1 } );

  // get all users belonging to a client
  db.client.user.find( { client_id: 76 } );

  // get all clients a user has access to
  db.client.user.find( { username: ‘ibwhite’ } );

  // get permissions for our current user
  db.client.user.findOne( { username: user.name } );
Probably Not


• Only needed if we have lots of clients per
  user AND lots of users per client
• This is a case where we can embed, so let’s
  do so
Three Ways to Embed
             ‘clients’: {
                ‘76’: ‘admin’,                                   Not good:
  Object        ‘450’: ‘reports_only’,                   can’t do a multikeys index
             },                                            on the keys of a hash
             index:???


                                                                  Okay:
  Array      ‘clients’: [
                {‘_id’: 76, ‘access’: ‘admin’},             but have to search
                                                              through array
of objects   },
                {‘_id’: 450, ‘access’: ‘reports_only’}
                                                              to find by _id
             index: { ‘clients._id’: 1 }                     on retrieved doc


             ‘clients’: [ 76, 450 ],
                                                            Our approach:
  Array
             ‘clients_access’: {
               ’76’: ‘admin’,                             Fields next to each
                                                          other alphabetically
and object
               ‘450’: ‘reports_only’,
             }
             index: { clients: 1 }
Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
  secondaries
• Reduce the size of indexes whereever you
  can on big collections
• Don’t sweat the medium-sized collections,
  focus on the big wins
Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
  1, email: 1 }

• Then you also have the {client_id:
  1} index “for free”

• but not {   email: 1}
Use your _id


• You must use an _id for every collection,
  which will cost you index size
• So do something useful with _id
Take advantage of fast
      ^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
        { _id: /^32423./ } );

•   (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
Manual Range
              Partioning
• We moved a big message.blast collection
    into per-day collections:
•   message.blast.20110605
    message.blast.20110606
    message.blast.20110607
    etc...


• Keeps working set indexes smaller
• When we move data into the archive,
    drop() is much faster than remove()
Questions?
Looking for a job?
     ian@sailthru.com
   twitter.com/eonwhite

Weitere ähnliche Inhalte

Was ist angesagt?

Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
 
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB
 
Building a Social Network with MongoDB
  Building a Social Network with MongoDB  Building a Social Network with MongoDB
Building a Social Network with MongoDBFred Chu
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataGregg Kellogg
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB
 
High Performance Applications with MongoDB
High Performance Applications with MongoDBHigh Performance Applications with MongoDB
High Performance Applications with MongoDBMongoDB
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingMyles Braithwaite
 

Was ist angesagt? (7)

Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
 
Building a Social Network with MongoDB
  Building a Social Network with MongoDB  Building a Social Network with MongoDB
Building a Social Network with MongoDB
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
High Performance Applications with MongoDB
High Performance Applications with MongoDBHigh Performance Applications with MongoDB
High Performance Applications with MongoDB
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
 

Ähnlich wie Mongo at Sailthru (MongoNYC 2011)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework MongoDB
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015kingsBSD
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Bryan Boyd
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhenDavid Peyruc
 
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...NoSQLmatters
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analyticsMatthias Funke
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 

Ähnlich wie Mongo at Sailthru (MongoNYC 2011) (20)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
 
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analytics
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 

Kürzlich hochgeladen

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Mongo at Sailthru (MongoNYC 2011)

  • 1. MongoDB at Sailthru Ian White @eonwhite MongoNYC 2011 6/7/11
  • 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Lot18, Fab, New York Observer
  • 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully)
  • 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now)
  • 5. MongoDB Overview • 11 instances on EC2 (5 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docs
  • 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing document
  • 7. User Profile { "_id" : ObjectId("4b2d368aed948543a5fca4b4"), "browser" : { "Chrome" : 3, "Firefox" : 1, "iPhone" : 2 }, "click_count" : 1, "click_time" : "Wed Feb 17 2010 09:03:37 GMT-0500 (EST)", "client_id" : 450, "email" : "ibwhite@gmail.com", "email_hour" : { "13" : 1, "14" : 2, "16" : 2, "17" : 2, "18" : 3, "21" : 2 }, "geo" : { "city" : { "New York, NY US" : 3, "Sterling, VA US" : 1 }, "count" : 6, "country" : { "US" : 6 }, "state" : { "NY US" : 3, "VA US" : 1 }, "zip" : { "10011 US" : 1, "10065 US" : 1 } }, "horizon" : { "admob" : 1, "advertising" : 3, "afghanistan" : 1, "aig" : 2, "airline-industry" : 2, "alleyinsider" : 45, "analyst-research" : 1, "apple" : 25, "apple-tablet" : 5, "att" : 8, "bailout" : 5, "banks" : 6, "barack-obama" : 25, "ben-bernanke" : 1, "big-tech" : 17, "billionaires" : 1, "boats" : 1, "bonus" : 6, "bp" : 1, "budget" : 1, "cable" : 1, "caribbean" : 2, "cars" : 5, "chart-of-the-day" : 3, "china" : 3, "clusterstock" : 36, "cnbc" : 1, "comcast" : 1, "commodities" : 3, "conan-obrien" : 6, "crime" : 2, "curbedcom" : 1, "death-of-tv" : 1, "debt" : 7, "deepwater-horizon-oil-spill" : 1, "dell" : 4, "development" : 1, "dick-fuld" : 1, "economy" : 10, "education" : 1, "employment" : 2, "entertainment" : 7, "europe" : 1, "facebook" : 4, "features" : 13, "financial-crisis" : 7, "financial-services" : 2, "fox" : 4, "fraud" : 1, "futures" : 1, "gadgets" : 21, "gas" : 1, "gawker" : 5, "gold" : 3, "goldman-sachs" : 1, "google" : 7, "green" : 5, "green-tech" : 2, "health" : 5, "health-care-reform" : 7, "hedge-funds" : 3, "hires-and-fires" : 1, "housing-crisis" : 1, "hp" : 4, "hulu" : 2, "humor" : 1, "iad" : 1, "international" : 3, "investing" : 5, "ios" : 1, "ipad" : 2, "iphone" : 10, "jay-leno" : 5, "jim-cramer" : 1, "jobs" : 2, "john-gruber" : 2, "law-firms" : 1, "lawreview" : 3, "lehman-brothers" : 1, "litigation" : 5, "luxury" : 1, "mac" : 1, "magazines" : 1, "markets" : 7, "media" : 20, "mercedesbenz" : 4, "microsoft" : 1, "mining" : 1, "mobile" : 14, "mobile-ads" : 2, "moguls" : 1, "money" : 6, "money-media" : 2, "moneygame" : 16, "morningstar" : 3, "mortgages" : 1, "mtv" : 1, "nbc" : 6, "new-york" : 1, "new-york-times" : 4, "news" : 9, "newspapers" : 5, "nouriel-roubini" : 6, "oil" : 1, "online" : 10, "optimum-energy" : 4, "paul-krugman" : 3, "people" : 5, "politics" : 26, "radio" : 1, "real-estate" : 2, "recession" : 4, "regulation" : 12, "sai" : 15, "satellite-radio" : 1, "scandals" : 5, "security" : 1, "senate" : 4, "silicon-alley-insider" : 1, "sirius" : 1, "social-networking" : 3, "sports" : 1, "startups" : 1, "steve-jobs" : 1, "stimulus" : 1, "stock- market" : 5, "stocks" : 3, "tax-cuts" : 1, "taxes" : 1, "tbi" : 163, "tbi-live" : 3, "terrorism" : 3, "the-atlantic" : 1, "the-way-we-live- now" : 1, "themoneygame" : 3, "thewire" : 17, "tim-geithner" : 3, "time-warner-cable" : 1, "transportation" : 7, "treasury" : 2, "tv" : 7, "tv-everywhere" : 1, "twitter" : 3, "uk" : 1, "unemployment" : 2, "us-government" : 8, "verizon" : 4, "video" : 6, "wall-st-cheat-sheet" : 1, "wall-street" : 25, "wall-street-journal" : 1, "warren-buffett" : 1, "white-house" : 4, "wwdc-2010" : 1, "yachts" : 1, "10gen" : 1, "2010- world-cup" : 1 }, "horizon_count" : 303, "horizon_time" : "Tue Dec 07 2010 15:26:35 GMT-0500 (EST)", "lists" : [ "TBI Research 1 - Beta", "Dedicated Email", "TBI Research", "411" ], "lists_signup" : { "BI_iphone App" : null, "Clusterstock Chart Of The Day" : null, "Clusterstock Select" : null, "Dedicated Email" : "Tue Dec 22 2009 13:29:43 GMT-0500 (EST)", "Dedicated Email - The Ladders" : null, "Green Sheet Select" : null, "Insider 411" : null, "Insider 411 - Economist" : null, "Insider 411 - Ooyala" : null, "Insider 411 - The Wire Promo" : null, "Insider 411- Economist" : null, "Law Review Select" : null, "Media Select" : null, "Silicon Alley Insider Chart Of The Day" : null, "Silicon Alley Insider Select" : null, "TBI Research" : "Tue Jan 05 2010 13:58:09 GMT-0500 (EST)", "TBI Research 1 - Beta" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "TBI Select" : null, "The Money Game Select" : null, "War Room Select" : null, "z_sailthru" : null, "10 Things Before the Opening Bell" : null, "411" : "Wed Jul 07 2010 11:28:03 GMT-0400 (EDT)" }, "open_count" : 11, "open_time" : "Tue Dec 07 2010 13:30:31 GMT-0500 (EST)", "optout_templates" : [ ], "order" : 12, "signup_time" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "site_hour" : { "20" : 1 }, "status" : null, "status_time" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "ts" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "urls" : [ "http://www.businessinsider.com/" ], "urls_count" : 1, "vars" : { "name" : "eonwhite" } }
  • 8. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if}
  • 9. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if}
  • 10. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest('black,dark')} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest('green')} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest('purple,polka_dot,pattern')} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select}
  • 11. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach}
  • 12. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexing
  • 13. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handy
  • 14. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBs
  • 15. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three months
  • 16. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS product
  • 17. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlock
  • 18. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t fun
  • 19. EC2 Future Plans • EC2 is great overall • EBS performance a little too inconsistent (even with RAID 0 or10) • Moving to relying on physical hardware (with SSD) in colo • Retain some nodes and backups on EC2 • Let you know how it goes in a few months
  • 21. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexes
  • 22. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNI
  • 23. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your data
  • 24. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types)
  • 25. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_id
  • 26. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joins
  • 27. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk space
  • 28. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collection
  • 29. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatable
  • 30. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structures
  • 31. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objects
  • 32. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joins
  • 33. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450
  • 34. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given user
  • 35. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitely
  • 36. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user)
  • 37. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } );
  • 38. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do so
  • 39. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 }
  • 40. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big wins
  • 41. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1}
  • 42. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _id
  • 43. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
  • 44. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove()
  • 45. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhite

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n