Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The Evolution of Data Architecture

1.259 Aufrufe

Veröffentlicht am

My perspective on the evolution of big data from the perspective of a distributed systems researcher & engineer -- the background of how it get started, the scale-out paradigm, industry use cases, open source development paradigm, and interesting future challenges.

Veröffentlicht in: Software
  • If you’re struggling with your assignments like me, check out ⇒ www.HelpWriting.net ⇐.
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • I’ve personally never heard of companies who can produce a paper for you until word got around among my college groupmates. My professor asked me to write a research paper based on a field I have no idea about. My research skills are also very poor. So, I thought I’d give it a try. I chose a writer who matched my writing style and fulfilled every requirement I proposed. I turned my paper in and I actually got a good grade. I highly recommend ⇒ www.WritePaper.info ⇐
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

The Evolution of Data Architecture

  1. 1. The Evolution of Data Architecture Wei-Chiu Chuang 2017. 10 @ NCKU 1
  2. 2. Who’s Wei-Chiu?
  3. 3. Data Value Chain AI Machine Learning Data Science Analytics Big Data Decision making Insight Automated Decision making Hype (?) 3
  4. 4. Data is the new Oil https://www.economist.com/news/leaders/2172165 6-data-economy-demands-new-approach-antitrust- rules-worlds-most-valuable-resource 4
  5. 5. Fastest way to transmit 5MB of data in 1956
  6. 6. 6 Fast forward 60 years… transmit 100PB of data in 2016
  7. 7. Once upon a time, processors double in speed every 18 months …  The “Moore’s Law” stopped 10 years ago.  CPU, RAM and disk almost stopped improving in speed ever since. 7
  8. 8. Processor speed has been stagnant  But data is being generated at ever increasing speed.  Hardware improvement cannot keep up with data generation.  Multi-threaded systems, distributed systems are the must. 8
  9. 9. Distributed Systems are hard Programmability Scalability Consistency Availability Partition Tolerance Fault Tolerance 9
  10. 10. Big Data/Parallel Computing/Distributed Sys. D HPCBig DataCloud Distributed Systems 10
  11. 11. Scale out 11
  12. 12. Modern Data Architecture How do you:  transmit  collect  store  compute Petabyte+ storage on 1000+ compute nodes? 12
  13. 13. Modern Data Center DataCenter ToR Server1 Server10 ToR Server1 Server10 ToR Server1 Server10 ToR Server1 Server10 Aggr Aggr Aggr Core Core Internet AR AR 10Gbps 10Gbps 1Gbps 13
  14. 14. GFS  Master – slave architecture  Separation of control plane and data plane  Low cost, commodity hardware  Failures are norm, rather than exceptions  Balance availability and network partition tolerance Control messages Data messages GFS Master GFS chunkservers /foo/bar GFS client 14
  15. 15. MapReduce  A very simple yet powerful distributed programming model  Share-nothing architecture  Programmability  Data-locality:  ship compute to data, rather than shipping data to compute  Fault tolerance:  Intermediate state is stored in storage.  Failed tasks can be restarted easily. Split 0 Split 1 Split 2 worker worker worker Input files Map phase worker worker Intermediate files Reduce phase Output 0 Output files Output 1 master assign map assign reduce 15
  16. 16. Hadoop 16
  17. 17. Hadoop  GFS, MapReduce inspired Hadoop  Initially developed by Yahoo!  Released in 2006.  Used by most large enterprises  Hadoop 3.0 beta 1! 17
  18. 18. 2006 2008 2009 2010 2011 2012 2013 Core Hadoop (HDFS, MapReduce) HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2007 Solr Pig Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2014 2015 Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Evolution of the Hadoop Platform  The stack is continually evolving and growing! 18
  19. 19. Mix and match Resource Management YARN Mesos Kubernetes Storage HDFS HBase Kudu S3 ADLS Compute MapReduce Hive Impala Spark Presto Pig Drill Solr Storm Ingest Kafka Flume Beam Samza 19
  20. 20. Open source in infra & platform 20
  21. 21. Why open source?  It’s free ($$$)  No vendor lock-in.  Faster development and faster adoption.  A new approach to foster collaboration.  Open source software is becoming the standard. 21
  22. 22. Sell open source software, really?  Water is free, but bottled water is not.  Cloudera sells the “bottle”  Cloudera’s Distribution of Hadoop.  The integration of software.  The support and services.  The management software is proprietary. The OSS is free of charge. 22
  23. 23. Market for open source software? 23 0 50 100 150 200 250 300 350 400 FY2015 FY2016 FY2017 FY2018 (f) Revenue (million USD) Hortonworks Cloudera MongoDB
  24. 24. Open Source Business Model • MySQL Dual licensing • RedHat, Hortonworks Support + services • Java EE, Qt Open core • DataBricks, Amazon AWS, Microsoft Azure Software as a Service • Google Chrome, Android Advertising-supported • Cloudera, Confluent, MongoDB Hybrid Open Source Software 24
  25. 25. Use Cases 25
  26. 26. “Big Data” finds many applications across many industries IT Healthcare Transportation Retail Utilities Telecomm Public sector Manufactring 27
  27. 27. Applications and Use cases  Realtime database for serving internet traffic  Internet services (Facebook messenger), Twitter, Uber, Airbnb …  Data analytics  Assist in the development of new drugs by analyzing millions of medical records  Data science / Machine learning  Fraud detection  Anti-money laundry  Cybersecurity 28
  28. 28. Fraud Detection System using Hadoop
  29. 29. The Cloudera Platform for IoT – Data Mgmt. Value Chain Data Sources Data Ingest Data Storage & Processing Serving, Analytics & Machine Learning ENTERPRISE DATA HUB Apache Kafka Stream or batch ingestion of IoT data Apache Sqoop Ingestion of data from relational sources Apache Hadoop Storage (HDFS) & deep batch processing Apache Kudu Storage & serving for fast changing data Apache HBase NoSQL data store for real time applications Apache Impala MPP SQL for fast analytics Cloudera Search Real time searchConnected Things/ Data Sources Other Data Sources Security, Scalability & Easy Management Deployment Flexibility: Datacenter Cloud Apache Spark Stream & iterative processing, ML
  30. 30. IoT Use Case 1: Predictive Maintenance
  31. 31. Predictive Maintenance on Thousands of Industrial Machinery in Real- Time Challenge: • Collect and analyze data from thousands of diverse manufacturing systems in real-time Solution: • iTrak application using Cloudera in the Cloud to monitor the performance of individual manufacturing systems in real-time • Predictive Maintenance - Proactively identifying & fixing issues before they break MANUFACTURING » INDUSTRIAL IoT » PREDICTIVE MAINTENANCE » IMPROVED EFFICIENCIES Industrial IoT – Predictive Maintenance DATA-DRIVEN PROCESS CASE STUDY DATA-DRIVEN PRODUCTS
  32. 32. Use Case 2: Connected Vehicles
  33. 33. Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime Challenge: • Monitor the health of 180,000+ trucks in real-time in order to minimize downtime Solution: • OnCommand Connection collecting telematics and geolocation data across thousands of trucks • Identify and correct engine problems early, and increase fleet uptime • Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile Connected Vehicles & Telematics DATA-DRIVEN PROCESS CASE STUDY DATA-DRIVEN PRODUCTS TRANSPORTATION » PREDICTIVE MAINTENANCE » TELEMETRY » LOWER TCO
  34. 34. Use Case 3: Smart Cities & Smart Infrastructure
  35. 35. Enabling the State of Kentucky manage snow and ice events in real time Challenge: • Kentucky Transportation Cabinet (KYTC) oversees the state’s transportation system, which includes 27,000 miles of highways, 230 airports and heliports, and more than three million drivers. • Needed more efficient approach to inclement weather road management Solution: • KYTC has built a real-time weather response system that incorporates real-time data from Waze, HERE, ESRI’s GeoEvent processor, and Automatic Vehicle Locations (providing sensor data from salt trucks). • KYTC aggregates 15-20 million records every day and process more than a million records per second. Data Driven Dept. of Transportation Source: http://www.routefifty.com/2016/09/data-drives-government/131821/ 2016 Data Impact Award Winner State of Kentucky Department of Transportation
  36. 36. Use Case 4: Connected Healthcare
  37. 37. Improve Parkinson's Disease Monitoring and Treatment through IoT Challenge: • Collect and analyze data from wearables (more than 300 readings per second) from thousands of patients in real-time Solution: • Cloudera on Intel architecture to detect patterns in patient data streaming from wearables • Continuously monitor the patients and symptoms to understand the progression of the disease objectively HEALTHCARE » WEARABLES » PREDICTIVE ANALYTICS » IMPROVED CARE Connected Healthcare DATA-DRIVEN PROCESS CASE STUDY DATA-DRIVEN PRODUCTS
  38. 38. Building a Holistic Picture of the US Securities Market From 50 Billion Daily Events • Saving $10-20M in operational efficiencies annually • 90-minute queries run in 10 seconds • Supporting future market growth and a dynamic regulatory environment. CUSTOMER 360
  39. 39. Using Big Data to Help Consumers Save Hundreds of Millions in Utility Bills • Relevant insight into household energy use improves energy consciousness • 2.7+ TWH (terawatt hours) saved to date • Motivated consumers to save enough energy to power every household in Salt Lake City and St. Louis for a year CUSTOMER 360 ENERGY & UTILITIES » PRODUCT INNOVATION » SERVICE IMPROVEMENT » IOT
  40. 40. Saving Lives by Detecting Sepsis Early Enough for Successful Treatment • Builds a more complete picture of patients, conditions, and trends • Has saved 100’s of lives already • Reduces hospital readmissions • 2PB+ in multi-tenant environment supporting 100s of clients • Secure yet explorable HEALTHCARE » 360° CUSTOMER VIEW » PREDICTIVE ANALYTICS » IMPROVED SERVICE
  41. 41. Improving Pediatric Care and Outcomes • Quantifying effect of ambient noise on children’s vital signs • Identifying cancerous genome variants in 20 minutes (vs. days before) • Performing fewer CT scans and higher quality surgeries CUSTOMER 360 HEALTHCARE » MACHINE LEARNING » IOT » 360o CUSTOMER VIEW
  42. 42. Government Revenue Service Increasing Customer Convenience • Provides view of the complete taxpayer journey • Creates ability to pre-populate tax returns for increased ease of use • Supports move to near-real-time oversight of operations and faster response CUSTOMER 360 GOVERNMENT » SERVICE IMPROVEMENT » PROCESS IMPROVEMENT » 360° CUSTOMER VIEW
  43. 43. Driving Growth and Innovation • Combines 80+ years’ data spanning all business units and 50 states • Expedites holistic analysis and reports by 500X • Enables more accurate and detailed predictive models to customize offers, optimizing pricing, and minimize risk CUSTOMER 360 INSURANCE » 360° CUSTOMER VIEW » FRAUD DETECTION » PREDICTIVE ANALYTICS
  44. 44. Re-Platformed 1,600 Operational Databases & Systems onto a Cloudera EDH • Business & consumer data was spread over a dozen different customer databases • One daily ETL job (processing 1 billion customer records) used to take 24 hours • Increased data velocity by 15x (5 times the data in 1/3 of the time) Now completes in 1 ½ hours • BT now has access to the most up-to- date and centralized data for all their customers CUSTOMER 360 TELECOMMUNICATIONS » IMPROVED SERVICE » PROCESS IMPROVEMENT » IT COST REDUCTION
  45. 45. Future 48
  46. 46. Future  Hardware evolution:  Cloud  40Gbps, 100Gbps networks  GPU, TPU  Flash disk  Application-driven:  Machine learning, deep learning  Realtime data stream processing (IoT) 49
  47. 47. Future How to scale by an order of magnitude in 5 years? We are here today In 10 years? 50
  48. 48. 台灣資料工程協會 Click to enter confidentiality information
  49. 49. 台灣人參與Apache Click to enter confidentiality information 葉祐欣 謝良奇、蔡東邦 陳恩平 戴資力 莊偉赳 蔡嘉平
  50. 50. Apache Contributor 育才賽 Click to enter confidentiality information
  51. 51. Takeaway If you only remember 3 things from this talk: 1.Data is the new Oil 2.Open source is the standard 3.Think big! Remember GFS: failures are the norm rather than the exception! 54
  52. 52. Thank you jojochuang@gmail.com / weichiu@apache.org / weichiu@cloudera.com 55