Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Beyond the Fridge, The World of Connected Data - Dr Werner Vogels

3.209 Aufrufe

Veröffentlicht am

Veröffentlicht in: Daten & Analysen, Technologie, Bildung

Beyond the Fridge, The World of Connected Data - Dr Werner Vogels

  1. 1. Beyond the Fridge
 The world of Connected Data ! Dr. Werner Vogels! CTO, Amazon.com!
  2. 2. The amount of information generated during the first day of a baby’s life today is equivalent to 70 times the information contained in the Library of Congress"
  3. 3. I. Science"
  4. 4. Observations – Theory – Models – Facts"
  5. 5. Human Genome Project" Collaborative project to sequence every single letter! of the human genetic code.! 13 years and $billions to complete.! Gigabyte scale datasets (transferred between sites on! iPods!)!
  6. 6. Beyond the Human Genome" 45+ species sequenced: mouse, rat, gorilla, rabbit, ! platypus, nematode, zebra fish...! Compare genomes between species to identify! biologically interesting areas of the genome.! 100Gb scale datasets. Increased computational requirements.!
  7. 7. The Next Generation" New sequencing instruments lead to a dramatic! drop in cost and time required to sequence a genome.! Sequence and compare genetic code of individuals to! find areas of variation. Much more interesting.! Terabyte scale datasets. Significant computational requirements.!
  8. 8. The 1000 Genomes Projects" Public/private consortium to build world’s largest! collection of human genetic variation.! Hugely important dataset to drive new insight into! known genetic traits, and the identification of new ones.! Vast, complex data and computational resources required, beyond reach of most research groups and hospitals.!
  9. 9. 1000 Genomes in the Cloud" The 1000 Genomes data made available to all on AWS.! Stored for free as part of the Public Datasets program.! Updated regularly.! 200Tb. 1700 individual genomes. As much compute and storage as required available to all.!
  10. 10. II. Consumer"
  11. 11. Dropcam  is  the  biggest  inbound  video   service  on  the  Web     •  More  data  uploaded  per   minute  than  YouTube     •  Petabytes  of  data   processed  every  month   •  Billions  of  mo=on  events   detected  
  12. 12. Lenddo’s  Journey   •  Process  about  3.5TB  of  social  data     •  Social  Data  growing  more  users     •  Started  with  MongoDB  cluster  on  CR1  instance   types  on  AWS  ,spending  10K  USD/month     •  Re-­‐architected  to  move  all  their  data  to  S3  and   keep  caches  in  smaller  mongodb  and  dynamodb   cluster.  Use  EMR  to  process  data   •  Now  spending  3K/month    
  13. 13. III. Retail"
  14. 14. UNCERTAINTY"
  16. 16. Who  is  my  customer  really?       What  do  people  really  like?     What  is  happening  socially  with  my  products?     Where  do  people  consume  my  product?   How  do  people  really  use  your  product?    
  17. 17. PERSONALIZE"
  18. 18. 75% of users select" movies based on" recommendations"
  19. 19. More than 27 million users! ~ 30 million plays per day! More than 40 billion events per day ! ~ 4 million ratings per day! ~ 3 million searches per day! Geo-location data! Device information! Time of day and week (it now can verify that users watch more TV shows during the week and more movies during the weekend)! Metadata from third parties such as Nielsen! Social media data from Facebook and Twitter!
  21. 21. Wego   •  Search  using  Flexible  dates  AND/OR  Loca=ons  and  Themes   –  FROM  Singapore  TO  Beach  FOR  A  Weekend  Trip  (theme  loca=on  +  flexible  date)   –  FROM  Singapore  TO  Paris  FOR  A  Whole-­‐week  Vaca=on  (specific  des=na=on  +  flexible   date)   –  FROM  Singapore  TO  Sydney  IN  Next  Two  Months  (specific  des=na=on  +  flexible  date)   –  FROM  Singapore  TO  Family-­‐friendly  Des=na=on  ON  30-­‐Apr  to  05-­‐May  (theme  loca=on   +  fixed  dates)   •  Need  for  robust  caching  mechanism  with  millions  of  flight  searches  with   10Million  +  different  flight  routes     •  Use  the  AWS  cloud  to  rapidly  spin  up  machines  to  scale  to  the  requirements   •  AWS  allows  them  to  do  this  in  a  scalable  and  cost  effec=ve  manner    
  22. 22. Wego  –  Search  
  23. 23. awsofa.info
  24. 24. The  only  Asian  company  which  made  it  to  the  CODE_n  finalist  list  for  CeBIT  2014  
  25. 25. Platform Architecture Archival  (Glacier)   Storage  (S3)   Crawl  Cluster  (EC2)   File  Server   (EC2)   Processing  Cluster  (EC2)   Choice  Engine  Cluster     (EC2)   Data   Partners   End  user   interac=on/Front   End   On  AWS   External  to  AWS   Integra=on  Engine   Data  Acquisi=on  
  26. 26. IV. Industrial"
  27. 27. Access Materials Data and Models from Global Partners! With Governance, Controllership, and Ownership! CEED  Collabora=ve  Federated  Environment  
  28. 28. V. Sports"
  29. 29. VI. Location"
  30. 30. VII. The Pipeline"
  31. 31. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  32. 32. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  33. 33. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  34. 34. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  35. 35. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  36. 36. COLLECT  |  STORE  |  ORGANIZE  |  ANALYZE  |  SHARE  
  37. 37. VIII. Real-time"
  38. 38. What was happening 
  39. 39. What ! right now?! trades are executing! is the exception rate! is the ad click-through! topics are trending" inventory remains! queries are slow! are the high scores! ! !
  40. 40. Kinesis!
  41. 41. Kinesis architecture Amazon Web Services AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  42. 42. AWS Internal Metering Service Capture Submissions Process in Realtime Store in Redshift Clients Submitting Data Workload •  Tens of millions records/sec •  Multiple TB per hour •  100,000s of sources New features •  Scale with the business •  Provide real-time alerting •  Inexpensive •  Improved auditing
  43. 43. Workload •  Daily load of billions records from millions of files from hundreds of sources •  3 hour SLA to load and audit data •  Hundreds of customers •  Hundreds of queries per hour New features •  Our data is fresh, we ingest every 6 hours •  Now processing triple the volume in less than 25% of the time •  “Hammerstone” ETL solution –  Built on AWS Data Pipeline –  Build business specific marts –  Build workload specific clusters •  Supports a variety of analytics tools: Tableau, R, Toad, SQL Developer, etc. Internal AWS Data Warehouse Over 200 internal data sources Data staged in Amazon S3 "Hammerstone:" Custom ETL using AWS Data Pipeline Data processing Redshift cluster Batch reporting Redshift cluster Ad hoc query Redshift cluster
  44. 44. IX. Beyond the Display"
  46. 46. Cloud enables connected data collection!
  47. 47. Cloud enables connected data processing!
  48. 48. Cloud enables connected data collaboration!
  49. 49. werner@amazon.com