Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
MongoD     &  B  Hadoop
Talking aboutMongoDB Intro & FundamentalsWhy MongoDB & HadoopGetting StartedUsing MongoDB & HadoopFuture of Big Data
Steve                  @sp                     A                      15+ years building                      the internet...
Company behind MongoDBOffices in NYC, Palo Alto, London & Dublin100+ employeesSupport, consulting, trainingMgt: Google/Doub...
Introduction     toMongoD
MongoDB         Application     Document                         Oriented    High                 { author : “steve”,     ...
MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins)...
Under the hoodWritten in C++Runs nearly everywhereData serialized to BSONExtensive use of memory-mapped filesi.e. read-thro...
Database LandscapeScalability & Performance                            Memcached                                          ...
“MongoDB has the bestfeatures of key/valuestores, documentdatabases andrelational databasesin one.        John Nunemaker
Relational made normalized     data look like this                      Category                  • Name                  ...
Document databases makenormalized data look like this                            Article                     • Name       ...
MongoDBUse Cases
CMS / BlogNeeds:• Business needed modern data store for rapid development and  scaleSolution:• Use PHP & MongoDBResults:• ...
Photo Meta-DataProblem:• Business needed more flexibility than Oracle could deliverSolution:• Use MongoDB instead of Oracle...
Customer AnalyticsProblem:• Deal with massive data volume across all customer sitesSolution:• Use MongoDB to replace Googl...
ArchivingWhy MongoDB:• Existing application built on MySQL• Lots of friction with RDBMS based archive storage• Needed more...
Online DictionaryProblem:• MySQL could not scale to handle their 5B+ documentsSolution:• Switched from MySQL to MongoDBRes...
E-commerceProblem:• Multi-vertical E-commerce impossible to model (efficiently) in  RDBMSSolution:• Switched from MySQL to ...
Tons more   MongoDB casts a wide net  people keep coming up with new and brilliant ways to use it
In Good Company   and 1000s more
WhyMongoDB& Hadoop
Applications have      complex needsUse the best tool for the jobOften more than one tool is neededMongoDB ideal operation...
MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map  re...
MongoDB              AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for agg...
MongoDB Map Reduce                        Map()MongoDB   Data                                              Group(k)       ...
Hadoop Map Reduce                                                                              Runs on same               ...
MongoDB & Hadoop                      same as Mongos          Many map operationsMongoDB             shard chunks (64mb)  ...
DEMO
DEMOInstall MongoDBInstall Hadoop & MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWr...
Installing MongoDBbrew install mongodbsudo easy_install pipsudo pip install pymongo
Installing Hadoopbrew install hadoop
Installing Mongo-hadoop                    https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cella...
Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live  ...
Map Timezones in Python  #!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mappe...
Writing Reducer in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef redu...
All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar  -mapper examples/twitter/twit_map.py -reducer...
Popular time zonesdb.twit_reduction.find().sort( {count : -1 }){   "_id"   :   ObjectId("4f45701903648ee13a565f9f"), "count...
DEMO 2
Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(d...
Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef redu...
All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar  -mapper examples/twitter/twit_hashtag_map.py ...
Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){   "_id"   :   "YouKnowYoureInLoveIf", "count" : 287 }{   "_i...
DEMO 3
Aggregation in Mongo 2.1     db.live.aggregate(    { $unwind : "$entities.hashtags" } ,    { $match :      { "entities.has...
Popular Hash Tags    db.twit_hashtags.aggregate(a){    "result" : [       { "_id" : "YouKnowYoureInLoveIf", "count" : 287 ...
UsingMongoD         &
Production usageOrbitzBadgevillefoursquareCityGrid             and more
FutureTheof     BIG data
What is BIG?  BIG today isnormal tomorrow
Google 2000Google Inc, today announced ithas released the largest searchengine on the Internet.Google’s new index, compris...
Google 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of ind...
BIG 2012 & BeyondMongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enabli...
Hadoop is our  first step
MongoDB is   committed to working with bestdata tools including  Storm, Spark, &       more
http://spf13.com                           http://github.com/s                           @spf13Question    download at mon...
MongoDB and hadoop
Nächste SlideShare
Wird geladen in …5
×

MongoDB and hadoop

43.057 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie

MongoDB and hadoop

  1. 1. MongoD & B Hadoop
  2. 2. Talking aboutMongoDB Intro & FundamentalsWhy MongoDB & HadoopGetting StartedUsing MongoDB & HadoopFuture of Big Data
  3. 3. Steve @sp A 15+ years building the internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
  4. 4. Company behind MongoDBOffices in NYC, Palo Alto, London & Dublin100+ employeesSupport, consulting, trainingMgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark LogicWell Funded: Sequoia, Union Square, Flybridge
  5. 5. Introduction toMongoD
  6. 6. MongoDB Application Document Oriented High { author : “steve”, date : new Date(),Performance text : “About MongoDB...”, tags : [“tech”, “database”]} Fully Consistent Horizontally Scalable
  7. 7. MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc
  8. 8. Under the hoodWritten in C++Runs nearly everywhereData serialized to BSONExtensive use of memory-mapped filesi.e. read-through write-throughmemory caching.
  9. 9. Database LandscapeScalability & Performance Memcached MongoDB RDBMS Depth of Functionality
  10. 10. “MongoDB has the bestfeatures of key/valuestores, documentdatabases andrelational databasesin one. John Nunemaker
  11. 11. Relational made normalized data look like this Category • Name • Url Article User • Name Tag• Name • Slug • Name• Email Address • Publish date • Url • Text Comment • Comment • Date • Author
  12. 12. Document databases makenormalized data look like this Article • Name • Slug • Publish date User • Text • Name • Author • Email Address Comment[] • Comment • Date • Author Tag[] • Value Category[] • Value
  13. 13. MongoDBUse Cases
  14. 14. CMS / BlogNeeds:• Business needed modern data store for rapid development and scaleSolution:• Use PHP & MongoDBResults:• Real time statistics• All data, images, etc stored together easy access, easy deployment, easy high availability• No need for complex migrations• Enabled very rapid development and growth
  15. 15. Photo Meta-DataProblem:• Business needed more flexibility than Oracle could deliverSolution:• Use MongoDB instead of OracleResults:• Developed application in one sprint cycle• 500% cost reduction compared to Oracle• 900% performance improvement compared to Oracle
  16. 16. Customer AnalyticsProblem:• Deal with massive data volume across all customer sitesSolution:• Use MongoDB to replace Google Analytics / Omniture optionsResults:• Less than one week to build prototype and prove business case• Rapid deployment of new features
  17. 17. ArchivingWhy MongoDB:• Existing application built on MySQL• Lots of friction with RDBMS based archive storage• Needed more scalable archive storage backendSolution:• Keep MySQL for active data (100mil)• MongoDB for archive (2+ billion)Results:• No more alter table statements taking over 2 months to run• Sharding enabled horizontal scale• Very happily looking at other places to use MongoDB
  18. 18. Online DictionaryProblem:• MySQL could not scale to handle their 5B+ documentsSolution:• Switched from MySQL to MongoDBResults:• Massive simplification of code base• Eliminated need for external caching system• 20x performance improvement over MySQL
  19. 19. E-commerceProblem:• Multi-vertical E-commerce impossible to model (efficiently) in RDBMSSolution:• Switched from MySQL to MongoDBResults:• Massive simplification of code base• Rapidly build, halving time to market (and cost)• Eliminated need for external caching system• 50x+ performance improvement over MySQL
  20. 20. Tons more MongoDB casts a wide net people keep coming up with new and brilliant ways to use it
  21. 21. In Good Company and 1000s more
  22. 22. WhyMongoDB& Hadoop
  23. 23. Applications have complex needsUse the best tool for the jobOften more than one tool is neededMongoDB ideal operational databaseMongoDB ideal for BIG dataNot a data processing engineFor heavy processing needs use tool designedfor that job ... Hadoop
  24. 24. MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store- Sharded environments do parallel processing
  25. 25. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesFixes some of limits of MongoDB MR- Can do realtime aggregation similar to SQL GroupBy- parallel processing on sharded clusters
  26. 26. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
  27. 27. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as mapMany map operations ctx.write(k2,v2) Combiner(k2,values2)1 at time per inputsplit same as k 2, v 3 Mongos emit similar to Mongos reducer similar to Partitioner(k2) Mongos group Sort(keys2) Reducer threads similar to Mongos Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
  28. 28. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
  29. 29. DEMO
  30. 30. DEMOInstall MongoDBInstall Hadoop & MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingPROFIT
  31. 31. Installing MongoDBbrew install mongodbsudo easy_install pipsudo pip install pymongo
  32. 32. Installing Hadoopbrew install hadoop
  33. 33. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
  34. 34. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  35. 35. Map Timezones in Python #!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: yield {_id: doc[user][time_zone],count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
  36. 36. Writing Reducer in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Processing Timezone %s" % key _count = 0 for v in values: _count += v[count] return {_id: key, count: _count}BSONReducer(reducer)
  37. 37. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_map.py -reducer examples/twitter/twit_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_map.py -file examples/twitter/twit_reduce.py
  38. 38. Popular time zonesdb.twit_reduction.find().sort( {count : -1 }){ "_id" : ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 }{ "_id" : "Central Time (US & Canada)", "count" : 16374 }{ "_id" : "Quito", "count" : 13708 }{ "_id" : "Greenland", "count" : 12332 }{ "_id" : "Santiago", "count" : 10153 }{ "_id" : "Eastern Time (US & Canada)", "count" : 8823 }{ "_id" : "Pacific Time (US & Canada)", "count" : 8530 }{ "_id" : "Brasilia", "count" : 6621 }{ "_id" : "London", "count" : 5617 }{ "_id" : "Mountain Time (US & Canada)", "count" : 4479 }{ "_id" : "Amsterdam", "count" : 4199 }{ "_id" : "Hawaii", "count" : 3381 }{ "_id" : "Tokyo", "count" : 2713 }{ "_id" : "Alaska", "count" : 2543 }{ "_id" : "Madrid", "count" : 2118 }{ "_id" : "Paris", "count" : 1538 }{ "_id" : "Buenos Aires", "count" : 1247 }{ "_id" : "Mexico City", "count" : 1104 }{ "_id" : "Caracas", "count" : 1089 }
  39. 39. DEMO 2
  40. 40. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
  41. 41. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
  42. 42. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
  43. 43. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
  44. 44. DEMO 3
  45. 45. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
  46. 46. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
  47. 47. UsingMongoD &
  48. 48. Production usageOrbitzBadgevillefoursquareCityGrid and more
  49. 49. FutureTheof BIG data
  50. 50. What is BIG? BIG today isnormal tomorrow
  51. 51. Google 2000Google Inc, today announced ithas released the largest searchengine on the Internet.Google’s new index, comprisingmore than 1 billion URLs
  52. 52. Google 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
  53. 53. BIG 2012 & BeyondMongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
  54. 54. Hadoop is our first step
  55. 55. MongoDB is committed to working with bestdata tools including Storm, Spark, & more
  56. 56. http://spf13.com http://github.com/s @spf13Question download at mongodb.orgWe’re hiring!! Contact us at jobs@10gen.com

×