Anzeige

MongoDB and hadoop

Open Source Strategy for the Go language at Google um Docker, Inc
29. Feb 2012
Anzeige

Más contenido relacionado

Anzeige
Anzeige

MongoDB and hadoop

  1. MongoD & B Hadoop
  2. Talking about MongoDB Intro & Fundamentals Why MongoDB & Hadoop Getting Started Using MongoDB & Hadoop Future of Big Data
  3. Steve @sp A 15+ years building the internet Father, husband, skateboarder Chief Solutions Architect @ responsible for drivers, integrations, web & docs
  4. Company behind MongoDB Offices in NYC, Palo Alto, London & Dublin 100+ employees Support, consulting, training Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic Well Funded: Sequoia, Union Square, Flybridge
  5. Introduction to MongoD
  6. MongoDB Application Document Oriented High { author : “steve”, date : new Date(), Performance text : “About MongoDB...”, tags : [“tech”, “database”]} Fully Consistent Horizontally Scalable
  7. MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc
  8. Under the hood Written in C++ Runs nearly everywhere Data serialized to BSON Extensive use of memory-mapped files i.e. read-through write-through memory caching.
  9. Database Landscape Scalability & Performance Memcached MongoDB RDBMS Depth of Functionality
  10. “ MongoDB has the best features of key/value stores, document databases and relational databases in one. John Nunemaker
  11. Relational made normalized data look like this Category • Name • Url Article User • Name Tag • Name • Slug • Name • Email Address • Publish date • Url • Text Comment • Comment • Date • Author
  12. Document databases make normalized data look like this Article • Name • Slug • Publish date User • Text • Name • Author • Email Address Comment[] • Comment • Date • Author Tag[] • Value Category[] • Value
  13. MongoDB Use Cases
  14. CMS / Blog Needs: • Business needed modern data store for rapid development and scale Solution: • Use PHP & MongoDB Results: • Real time statistics • All data, images, etc stored together easy access, easy deployment, easy high availability • No need for complex migrations • Enabled very rapid development and growth
  15. Photo Meta-Data Problem: • Business needed more flexibility than Oracle could deliver Solution: • Use MongoDB instead of Oracle Results: • Developed application in one sprint cycle • 500% cost reduction compared to Oracle • 900% performance improvement compared to Oracle
  16. Customer Analytics Problem: • Deal with massive data volume across all customer sites Solution: • Use MongoDB to replace Google Analytics / Omniture options Results: • Less than one week to build prototype and prove business case • Rapid deployment of new features
  17. Archiving Why MongoDB: • Existing application built on MySQL • Lots of friction with RDBMS based archive storage • Needed more scalable archive storage backend Solution: • Keep MySQL for active data (100mil) • MongoDB for archive (2+ billion) Results: • No more alter table statements taking over 2 months to run • Sharding enabled horizontal scale • Very happily looking at other places to use MongoDB
  18. Online Dictionary Problem: • MySQL could not scale to handle their 5B+ documents Solution: • Switched from MySQL to MongoDB Results: • Massive simplification of code base • Eliminated need for external caching system • 20x performance improvement over MySQL
  19. E-commerce Problem: • Multi-vertical E-commerce impossible to model (efficiently) in RDBMS Solution: • Switched from MySQL to MongoDB Results: • Massive simplification of code base • Rapidly build, halving time to market (and cost) • Eliminated need for external caching system • 50x+ performance improvement over MySQL
  20. Tons more MongoDB casts a wide net people keep coming up with new and brilliant ways to use it
  21. In Good Company and 1000s more
  22. Why MongoDB & Hadoop
  23. Applications have complex needs Use the best tool for the job Often more than one tool is needed MongoDB ideal operational database MongoDB ideal for BIG data Not a data processing engine For heavy processing needs use tool designed for that job ... Hadoop
  24. MongoDB Map Reduce MongoDB map reduce quite capable... but with limits - Javascript not best language for processing map reduce - Javascript limited in external data processing libraries - Adds load to data store - Sharded environments do parallel processing
  25. MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Fixes some of limits of MongoDB MR - Can do realtime aggregation similar to SQL GroupBy - parallel processing on sharded clusters
  26. MongoDB Map Reduce Map() MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
  27. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as map Many map operations ctx.write(k2,v2) Combiner(k2,values2) 1 at time per input split same as k 2, v 3 Mongo's emit similar to Mongo's reducer similar to Partitioner(k2) Mongo's group Sort(keys2) Reducer threads similar to Mongo's Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
  28. MongoDB & Hadoop same as Mongo's Many map operations MongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx) single server or sharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k ) MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
  29. DEMO
  30. DEMO Install MongoDB Install Hadoop & MongoDB Plugin Import tweets from twitter Write mapper in Python using Hadoop streaming Write reducer in Python using Hadoop streaming PROFIT
  31. Installing MongoDB brew install mongodb sudo easy_install pip sudo pip install pymongo
  32. Installing Hadoop brew install hadoop
  33. Installing Mongo-hadoop https://gist.github.com/1887726 hadoop_version '0.23' hadoop_path="/usr/local/Cellar/hadoop/ $hadoop_version.0/libexec/lib" git clone git://github.com/mongodb/mongo-hadoop.git cd mongo-hadoop sed -i '' "s/default/$hadoop_version/g" build.sbt cd streaming ./build.sh
  34. Groking Twitter curl https://stream.twitter.com/1/statuses/ sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  35. Map Timezones in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): for doc in documents: yield {'_id': doc['user']['time_zone'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  36. Writing Reducer in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing Timezone %s" % key _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)
  37. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar - mapper examples/twitter/twit_map.py -reducer examples/twitter/twit_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_map.py -file examples/twitter/twit_reduce.py
  38. Popular time zones db.twit_reduction.find().sort( {'count' : -1 }) { "_id" : ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 } { "_id" : "Central Time (US & Canada)", "count" : 16374 } { "_id" : "Quito", "count" : 13708 } { "_id" : "Greenland", "count" : 12332 } { "_id" : "Santiago", "count" : 10153 } { "_id" : "Eastern Time (US & Canada)", "count" : 8823 } { "_id" : "Pacific Time (US & Canada)", "count" : 8530 } { "_id" : "Brasilia", "count" : 6621 } { "_id" : "London", "count" : 5617 } { "_id" : "Mountain Time (US & Canada)", "count" : 4479 } { "_id" : "Amsterdam", "count" : 4199 } { "_id" : "Hawaii", "count" : 3381 } { "_id" : "Tokyo", "count" : 2713 } { "_id" : "Alaska", "count" : 2543 } { "_id" : "Madrid", "count" : 2118 } { "_id" : "Paris", "count" : 1538 } { "_id" : "Buenos Aires", "count" : 1247 } { "_id" : "Mexico City", "count" : 1104 } { "_id" : "Caracas", "count" : 1089 }
  39. DEMO 2
  40. Map Hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  41. Reduce hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count} BSONReducer(reducer)
  42. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar - mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
  43. Popular Hash Tags db.twit_hashtags.find().sort( {'count' : -1 }) { "_id" : "YouKnowYoureInLoveIf", "count" : 287 } { "_id" : "teamfollowback", "count" : 200 } { "_id" : "RT", "count" : 150 } { "_id" : "Arsenal", "count" : 148 } { "_id" : "milars", "count" : 145 } { "_id" : "sanremo", "count" : 145 } { "_id" : "LoseMyNumberIf", "count" : 139 } { "_id" : "RelationshipsShould", "count" : 137 } { "_id" : "Bahrain", "count" : 129 } { "_id" : "bahrain", "count" : 125 } { "_id" : "oomf", "count" : 117 } { "_id" : "BabyKillerOcalan", "count" : 106 } { "_id" : "TeamFollowBack", "count" : 105 } { "_id" : "WhyDoPeopleThink", "count" : 102 } { "_id" : "np", "count" : 100 }
  44. DEMO 3
  45. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } )
  46. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1 }
  47. Using MongoD &
  48. Production usage Orbitz Badgeville foursquare CityGrid and more
  49. Future The of BIG data
  50. What is BIG? BIG today is normal tomorrow
  51. Google 2000 Google Inc, today announced it has released the largest search engine on the Internet. Google’s new index, comprising more than 1 billion URLs
  52. Google 2008 Our indexing system for processing links indicates that we now count 1 trillion unique URLs (and the number of individual web pages out there is growing by several billion pages per day).
  53. BIG 2012 & Beyond MongoDB enables us to scale with the redefinition of BIG. New processing tools like Hadoop & Storm are enabling us to process the new BIG.
  54. Hadoop is our first step
  55. MongoDB is committed to working with best data tools including Storm, Spark, & more
  56. http://spf13.com http://github.com/s @spf13 Question download at mongodb.org We’re hiring!! Contact us at jobs@10gen.com

Hinweis der Redaktion

  1. \n
  2. 10\n15\n10\n5\n
  3. \n
  4. \n
  5. \n
  6. \n
  7. By reducing transactional semantics the db provides, one can still solve an interesting set of problems where performance is very important, and horizontal scaling then becomes easier.\n\n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
Anzeige