MongoDB and hadoop

41.657 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
4 Kommentare
108 Gefällt mir
Statistik
Notizen
  • This is really good. Can I have a copy of this ?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • With the older Twitter API being shut off basic authentication and thus curl no longer works. For slide 34 you need to install Twurl (Twitter curl with OAuth support), authenticate according to the instructions in the Github project, and run the following:

    twurl --host stream.twitter.com /1.1/statuses/sample.json | mongoimport -d test -c live

    Twurl can be found at:
    https://github.com/marcel/twurl
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • This is good .
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Great Slides
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
Keine Downloads
Aufrufe
Aufrufe insgesamt
41.657
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
2.156
Aktionen
Geteilt
0
Downloads
1.567
Kommentare
4
Gefällt mir
108
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • \n
  • 10\n15\n10\n5\n
  • \n
  • \n
  • \n
  • \n
  • By reducing transactional semantics the db provides, one can still solve an interesting set of problems where performance is very important, and horizontal scaling then becomes easier.\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • MongoDB and hadoop

    1. 1. MongoD & B Hadoop
    2. 2. Talking aboutMongoDB Intro & FundamentalsWhy MongoDB & HadoopGetting StartedUsing MongoDB & HadoopFuture of Big Data
    3. 3. Steve @sp A 15+ years building the internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
    4. 4. Company behind MongoDBOffices in NYC, Palo Alto, London & Dublin100+ employeesSupport, consulting, trainingMgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark LogicWell Funded: Sequoia, Union Square, Flybridge
    5. 5. Introduction toMongoD
    6. 6. MongoDB Application Document Oriented High { author : “steve”, date : new Date(),Performance text : “About MongoDB...”, tags : [“tech”, “database”]} Fully Consistent Horizontally Scalable
    7. 7. MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc
    8. 8. Under the hoodWritten in C++Runs nearly everywhereData serialized to BSONExtensive use of memory-mapped filesi.e. read-through write-throughmemory caching.
    9. 9. Database LandscapeScalability & Performance Memcached MongoDB RDBMS Depth of Functionality
    10. 10. “MongoDB has the bestfeatures of key/valuestores, documentdatabases andrelational databasesin one. John Nunemaker
    11. 11. Relational made normalized data look like this Category • Name • Url Article User • Name Tag• Name • Slug • Name• Email Address • Publish date • Url • Text Comment • Comment • Date • Author
    12. 12. Document databases makenormalized data look like this Article • Name • Slug • Publish date User • Text • Name • Author • Email Address Comment[] • Comment • Date • Author Tag[] • Value Category[] • Value
    13. 13. MongoDBUse Cases
    14. 14. CMS / BlogNeeds:• Business needed modern data store for rapid development and scaleSolution:• Use PHP & MongoDBResults:• Real time statistics• All data, images, etc stored together easy access, easy deployment, easy high availability• No need for complex migrations• Enabled very rapid development and growth
    15. 15. Photo Meta-DataProblem:• Business needed more flexibility than Oracle could deliverSolution:• Use MongoDB instead of OracleResults:• Developed application in one sprint cycle• 500% cost reduction compared to Oracle• 900% performance improvement compared to Oracle
    16. 16. Customer AnalyticsProblem:• Deal with massive data volume across all customer sitesSolution:• Use MongoDB to replace Google Analytics / Omniture optionsResults:• Less than one week to build prototype and prove business case• Rapid deployment of new features
    17. 17. ArchivingWhy MongoDB:• Existing application built on MySQL• Lots of friction with RDBMS based archive storage• Needed more scalable archive storage backendSolution:• Keep MySQL for active data (100mil)• MongoDB for archive (2+ billion)Results:• No more alter table statements taking over 2 months to run• Sharding enabled horizontal scale• Very happily looking at other places to use MongoDB
    18. 18. Online DictionaryProblem:• MySQL could not scale to handle their 5B+ documentsSolution:• Switched from MySQL to MongoDBResults:• Massive simplification of code base• Eliminated need for external caching system• 20x performance improvement over MySQL
    19. 19. E-commerceProblem:• Multi-vertical E-commerce impossible to model (efficiently) in RDBMSSolution:• Switched from MySQL to MongoDBResults:• Massive simplification of code base• Rapidly build, halving time to market (and cost)• Eliminated need for external caching system• 50x+ performance improvement over MySQL
    20. 20. Tons more MongoDB casts a wide net people keep coming up with new and brilliant ways to use it
    21. 21. In Good Company and 1000s more
    22. 22. WhyMongoDB& Hadoop
    23. 23. Applications have complex needsUse the best tool for the jobOften more than one tool is neededMongoDB ideal operational databaseMongoDB ideal for BIG dataNot a data processing engineFor heavy processing needs use tool designedfor that job ... Hadoop
    24. 24. MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store- Sharded environments do parallel processing
    25. 25. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesFixes some of limits of MongoDB MR- Can do realtime aggregation similar to SQL GroupBy- parallel processing on sharded clusters
    26. 26. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
    27. 27. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as mapMany map operations ctx.write(k2,v2) Combiner(k2,values2)1 at time per inputsplit same as k 2, v 3 Mongos emit similar to Mongos reducer similar to Partitioner(k2) Mongos group Sort(keys2) Reducer threads similar to Mongos Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
    28. 28. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
    29. 29. DEMO
    30. 30. DEMOInstall MongoDBInstall Hadoop & MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingPROFIT
    31. 31. Installing MongoDBbrew install mongodbsudo easy_install pipsudo pip install pymongo
    32. 32. Installing Hadoopbrew install hadoop
    33. 33. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
    34. 34. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
    35. 35. Map Timezones in Python #!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: yield {_id: doc[user][time_zone],count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
    36. 36. Writing Reducer in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Processing Timezone %s" % key _count = 0 for v in values: _count += v[count] return {_id: key, count: _count}BSONReducer(reducer)
    37. 37. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_map.py -reducer examples/twitter/twit_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_map.py -file examples/twitter/twit_reduce.py
    38. 38. Popular time zonesdb.twit_reduction.find().sort( {count : -1 }){ "_id" : ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 }{ "_id" : "Central Time (US & Canada)", "count" : 16374 }{ "_id" : "Quito", "count" : 13708 }{ "_id" : "Greenland", "count" : 12332 }{ "_id" : "Santiago", "count" : 10153 }{ "_id" : "Eastern Time (US & Canada)", "count" : 8823 }{ "_id" : "Pacific Time (US & Canada)", "count" : 8530 }{ "_id" : "Brasilia", "count" : 6621 }{ "_id" : "London", "count" : 5617 }{ "_id" : "Mountain Time (US & Canada)", "count" : 4479 }{ "_id" : "Amsterdam", "count" : 4199 }{ "_id" : "Hawaii", "count" : 3381 }{ "_id" : "Tokyo", "count" : 2713 }{ "_id" : "Alaska", "count" : 2543 }{ "_id" : "Madrid", "count" : 2118 }{ "_id" : "Paris", "count" : 1538 }{ "_id" : "Buenos Aires", "count" : 1247 }{ "_id" : "Mexico City", "count" : 1104 }{ "_id" : "Caracas", "count" : 1089 }
    39. 39. DEMO 2
    40. 40. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
    41. 41. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
    42. 42. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
    43. 43. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
    44. 44. DEMO 3
    45. 45. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
    46. 46. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
    47. 47. UsingMongoD &
    48. 48. Production usageOrbitzBadgevillefoursquareCityGrid and more
    49. 49. FutureTheof BIG data
    50. 50. What is BIG? BIG today isnormal tomorrow
    51. 51. Google 2000Google Inc, today announced ithas released the largest searchengine on the Internet.Google’s new index, comprisingmore than 1 billion URLs
    52. 52. Google 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
    53. 53. BIG 2012 & BeyondMongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
    54. 54. Hadoop is our first step
    55. 55. MongoDB is committed to working with bestdata tools including Storm, Spark, & more
    56. 56. http://spf13.com http://github.com/s @spf13Question download at mongodb.orgWe’re hiring!! Contact us at jobs@10gen.com

    ×