Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Achieving compute and storage independence for data-driven workloads

245 Aufrufe

Veröffentlicht am

Two Sigma Open Source Meetup
Presenters: Wenbo Zhao from Two Sigma and Bin Fan from Alluxio

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Achieving compute and storage independence for data-driven workloads

  1. 1. Achieving compute and storage independence for data-driven workloads Two SigmaOpen Source Meetup Wenbo Zhao | Software Engineer |Two Sigma Bin Fan | Founding Engineer &VP, Open Source | Alluxio
  2. 2. Agenda Why storage-independent compute? Alluxio Overview Demo: Spark + Alluxio + S3 Future work
  3. 3. A Stack to Enable Cost-Effective Elastic Compute Spark/Presto/TF.. Alluxio HDFS Public Cloud On-Prem Burst compute workload to Cloud Data in a fully controlled environment Improved Data Locality
  4. 4. The Alluxio Story Project started asTachyon, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CEO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Unify Data at Memory Speed for data driven applications such as Big Data Analytics, ML and AI. 2018 Top10 Hottest Data Storage Startup
  5. 5. From mainframes to Big Data Moving from tightly integrated to loosely integrated architectures Application, processing, data storage and hardware - All-in-one tightly coupled Client server architecture drives application separation. Processing and data storage still tightly coupled Data growth drives distributed MPP architectures but processing and data storage still tightly coupled Further data growth drives distributed file system architecture. Processing and data storage co- located but loosely coupled 1970s 1980s 2000s 2010s
  6. 6. The Big Data Ecosystem Co-located compute and storage for big data workloads § More defined and loosely coupled compute layer compared with relational databases § But compute / data processing still runs on the same node as where the data is stored. MapReduce runs on HDFS across the cluster § Compute layer and storage layer must be scaled out by the same factor
  7. 7. The Big Data Ecosystem Explodes Moving from tightly integrated to loosely integrated architectures STORAGE COMPUTE
  8. 8. Why independently scale compute and storage for data-driven applications? Flexible compute scaling based on application demands Flexible storage scaling based on data growth patterns Compute is CPU bound Storage is I/O bound
  9. 9. Why independently scale compute and storage for data-driven applications? Flexibility of using hardware / instances that match the workload CPU / GPU + IO + Memory S3 Leverage cheaper and newer storage like object stores for big data / AI workloads Orchestrate & automate compute for greater operational efficiency Protect & control your data on premises and leverage public cloud for compute
  10. 10. The challenges of independent scaling for data-driven workloads Data Locality Data Accessibility Data Abstraction Data is no more local to compute and workload processing time will increase particularly in hybrid cloud deployments Data is in multiple storage systems in multiple locations. Highly complex when all compute frameworks talk to all storage systems Data can still only be accessed using the specific storage system APIs
  11. 11. Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  12. 12. Alluxio Key Innovations
  13. 13. Unified Namespace Bring all files into a single interface Interact with data using any API Accelerate & tier data transparently API Translation Intelligent Multi-tiering Key Innovations of theVirtual Unified File System
  14. 14. Data Abstraction via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  15. 15. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2 Interacting with data in Alluxio – Unified Namespace
  16. 16. Data Accessibility via Server-side API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  17. 17. Interacting with data in Alluxio – variety of APIs Spark Hadoop POSIX Java Application have great flexibility to read / write data with many options > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) $ hadoop fs -cat alluxio://localhost:19998/myInput $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  18. 18. Data Locality via Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  19. 19. Interacting with data in Alluxio – flexible app patterns Reading Data • Data not in Alluxio (i.e. first time, or no cache) • Data on same node as client • Data on different node from client Writing Data • Write only to Alluxio • Write only to Under Store • Write synchronously to Alluxio and Under Store • Write to Alluxio and asynchronously write to Under Store • Write to Alluxio and replicate to N other workers • Write to Alluxio and async write to multiple Under stores Application have great flexibility to read / write data with many options
  20. 20. Interacting with data in Alluxio – data management Data Management • Pinning • Prefetch/free • Cross mount points cp/mv • TTL Application have great flexibility to read / write data with many options
  21. 21. Demo: Spark + Alluxio + S3
  22. 22. Deployment Approaches Spark Alluxio Storage Co-locate Alluxio Workers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  23. 23. Alluxio Master Zookeeper / RAFT Standby Master Under Store 1 Under Store 2 WANAlluxio Client Application Object Store Alluxio Client Application Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture
  24. 24. Read data in Alluxio, on same node as client 24 Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  25. 25. Read data not in Alluxio 25 RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  26. 26. Write data only to Alluxio on same node as client 26 Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  27. 27. Write data to Alluxio and Under Store synchronously 27 RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  28. 28. Real world Use cases
  29. 29. § Huya: 1300s (100% Hadoop nodes) § Sogou: 1000s (60% Hadoop nodes) § Momo: 850 nodes § JD: 600 nodes § Tencent: 400 nodes § Huawei Cloud: 150 nodes § China Unicom: 100 nodes § … Truly independent scaling of the data stack Alluxio Production Deployments
  30. 30. China Unicom Challenge Desired a central view of business data across multiple systems for big data workloads Solution Alluxio integrates data across multiple storage system to be accessed by Spark in a hybrid environment Impact Significantly faster workloads and faster innovation
  31. 31. Machine Learning Case Study Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W SPARK TERADATA SPARK TERADATA
  32. 32. Analytics Use Case – Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh SPARK HDFS SPARK HDFS
  33. 33. Truly independent scaling of the data stack Coming Online Events: POSIX API & 2.0 Preview https://go.alluxio.org/OH-Running-Tensorflow-Alluxio-S3 https://go.alluxio.com/webinar-alluxio-preview-release
  34. 34. Incredible Open Source Momentum with growing community 900+ contributors & growing 3900+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads https://www.alluxio.org/slack

×