Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Feature Store as a Data Foundation for Machine Learning

Looking to design and build a centralized, scalable Feature Store for your Data Science & Machine Learning teams to take advantage of? Come and learn from experts of Provectus and Amazon Web Services (AWS) how to!

Feature Store is a key component of the ML stack and data infrastructure, which enables feature engineering and management. By having a Feature Store, organizations can save massive amounts of resources, innovate faster, and drive ML processes at scale. In this webinar, you will learn how to build a Feature Store with a data mesh pattern and see how to achieve consistency between real-time and training features, to improve reproducibility with time-traveling for data.
Agenda
- Modern Data Lakes & Modern ML Infrastructure
- Existing and Emerging Architectural Shifts
- Feature Store: Overview and Reference Architecture
- AWS Perspective on Feature Store

Intended Audience
Technology executives & decision makers, manager-level tech roles, data architects & analysts, data engineers & data scientists, ML practitioners & ML engineers, and developers

Presenters
- Stepan Pushkarev, Chief Technology Officer, Provectus
- Gandhi Raketla, Senior Solutions Architect, AWS
- German Osin, Senior Solutions Architect, Provectus

Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions!

REQUEST WEBINAR: https://provectus.com/webinar-feature-store-as-data-foundation-for-ml-nov-2020/

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Feature Store as a Data Foundation for Machine Learning

  1. 1. Feature Store as a Data Foundation for ML Presented by: Stepan Pushkarev, CTO @ Provectus Gandhi Raketla, Senior Solutions Architect @ AWS
  2. 2. 1. Introductions 2. Modern Data Lakes and Modern ML Infrastructure 3. Emerging Architectural Shifts 4. Feature Store: 200 LOD overview and reference architecture on AWS 5. AWS Perspective on Feature Store Agenda
  3. 3. Introductions Stepan Pushkarev Chief Technology Officer, Provectus Gandhi Raketla Senior Solutions Architect, AWS German Osin Senior Solutions Architect, Provectus
  4. 4. Clients ranging from fast-growing startups to large enterprises 450 employees and growing Established in 2010 HQ in Palo Alto Offices across the US, Canada, and Europe We are obsessed about leveraging cloud, data, and AI to reimagine the way businesses operate, compete, and deliver customer value AI-First Consultancy & Solutions Provider
  5. 5. Innovative Tech Vendors Seeking for niche expertise to differentiate and win the market Midsize to Large Enterprises Seeking to accelerate innovation, achieve operational excellence Our Clients
  6. 6. Challenges of Modern Data Platforms
  7. 7. Modern Data Lakes You Know
  8. 8. Common Challenges: Data Access and Discoverability 1. Data is scattered across multiple data sources and technologies 2. Tedious process of managing AWS IAM roles, Amazon S3 policies, API Gateways, Database permissions 3. Gets even more complicated in AWS multi- account setup 4. Metadata is not discoverable 5. As a result - all the investments into Data and ML are killed by data access issues
  9. 9. 1. Lack of ownership and domain context — A disconnect between data producers and data consumers 2. Backlogged data team struggling to keep pace with business demands 3. No Contracts between Data and ML Engineering 4. As a result, fast end-to-end experimentation is killed by complex dependencies between teams Common Challenges: Monolithic Data Teams https://martinfowler.com/articles/data-monolith-to-mesh.html
  10. 10. Common Challenges: ML Experimentation Infrastructure 1. Inherited issues with Data Discovery and Data Access 2. Reproducibility of datasets, ML pipelines, ML Environments, and offline experiments is still an issue 3. Production Experimentation frameworks are fairly immature yet 4. As a result, the cost of an end-to-end experiment from data to production ML metric is 3-6 months https://hbr.org/2020/03/building-a-culture-of-experimentation
  11. 11. Common Challenges: Scaling ML Adoption in Production 1. Online serving. There is no unified and consistent way to access features during model serving. 2. Impossible to reuse features between multiple training pipelines and ML applications. 3. Monitoring and maintenance of ML Applications. 4. As a result, time and cost to scale from 1 to 100 models in production is growing exponentially. What is your cost per ML Model in Production?
  12. 12. Emerging Architectural Shifts
  13. 13. Emerging Architectural Shifts Data Lake -> Hudi/Delta Lakes Hudi/Delta Lakes bring managed ingestion, ACID transactions and point in time queries into traditional Data Lakes Data Lake -> Data Mesh Ownership of data domains, data pipelines, metadata, and API is shifting from centralized teams to product teams Data Lake -> Data Infrastructure as a platform Unified reusable platform components and frameworks across enterprise Endpoint Protection -> Global Data Governance Data Security and privacy measures are becoming centralized as part of Data Platform Metadata Store -> Global Data Catalog User Experience around data discovery, lineage, and versioning requires investments into metadata-rich Data Catalog Feature Store Scaling ML Experimentation and Operations requires a separate data management layer for ML Features ML Toolkit -> Complete ML Infrastructure ML capabilities are democratized for ML Engineers and citizen Data Scientists
  14. 14. ACID Data Lakes ● Managed Ingestion ● Dataset versioning for ML training ● Cheap “Deletes” (common GDPR use case) ● Audit log to any changes in datasets ● Brings ACID transactions in your data lake ● “Upserts” strategy on data ingestion ● Enables schemas to enforce data quality Delta/Hudi Lakes
  15. 15. Global Data Governance Accelerate privacy operations with data you already have. Automate business processes, data mapping, and PI discovery and classification for privacy workflows. Operationalize policies in a central location. Govern privacy policies to ensure policies are effectively managed across the enterprise. Define and document workflows, traceability views, and business process registers. Scale compliance across multiple regulations. Use a platform designed and built with privacy in mind that is easily extensible to support new regulations. AWS Config AWS Lake Formation
  16. 16. Global Data Catalog Meta-metadata store: ● Does this data exist? Where is it? ● What is the source of truth of this data? ● Do I have access? ● Who is the owner? ● Who are the users of this data? ● Are there existing assets I can reuse? ● Can I trust this data? * There are no established leaders in open source
  17. 17. The Core of MLOps and Reproducible Experimentation Pipelines Model Code ML Pipeline Code Infrastructure as a Code Versioned Dataset Production Metrics & Alerts Model Artifacts Prediction Service ML Metrics Automated Pipeline Execution Pipeline Metadata Alerts Reports Feature Store Orchestration: Idempotent Execution Feedback Loop for Production Data
  18. 18. Feature Store
  19. 19. Feature Store Value Proposition A data management layer for machine learning features. 1. Better ROI from feature engineering through reduction of cost per model — Facilitates collaboration, sharing, and reusing of features 2. Faster time to market for new models through increased productivity of ML Engineers - Decoupled storage implementation and features serving API
  20. 20. ● Personalization & Recommendation Engines ● Dynamic Pricing Optimization ● Supply Chain Optimization ● Logistics and Transportation Optimization Feature Store: Canonical Use Cases ● Fraud Detection ● Predictive Maintenance ● Demand Forecasting * All the use cases where ML models need a stateful ever changing representation of the system
  21. 21. ● Online Feature Store Online applications look up for a feature vector that is sent to an ML model for predictions ● ML specific Metadata Enables features discoverability and reuse Feature Store: Concepts ● ML Specific API and SDK High level operations for fetching training feature sets and online access ● Materialized Versioned Datasets Maintains versions of featuresets used to train ML models Raw Data Feature StoreFeature Engineering Training Serving Discovery
  22. 22. Platform License Supported Platforms Feast (now backed by Tecton) Apache V2 AWS (in roadmap), GCP Uber Michelangelo In-house product N/A Hopsworks AGPL-V3 AWS, GCP, On-Premises Tecton Enterprise AWS, GCP & Azure (2021) Airbnb Zipline In-house product N/A Comcast In-house product N/A Netflix Metaflow In-house product N/A Twitter In-house product N/A Facebook FBLearner In-house product N/A Pinterest Galaxy In-house product N/A Feature Store: Market
  23. 23. Pros: ● Battle-tested with GoJek, Farfetch, Postmates, and Zulily ● Integrated with Kubeflow ● Good community Cons (to be addressed in the roadmap): ● GCP only ● Infrastructure-heavy ● Lacks composability ● No Data Versioning * Now backed by Tecton * https://blog.feast.dev/post/a-state-of-feast Feast Offline Store (BigQuery) Online Serving Historical Serving Feature Registry Online Store (Redis) Ingestion Training Discovery Serving Ingestion API Ingestion
  24. 24. Pros: ● Integrates with most Python libs for ingestion and training ● Supports offline store with time travel ● AWS / GCP / Azure / On-Prem Ready Cons: ● Hard to use out of HopsML infrastructure ● Online store might not fit all latency requirements * Online serving is part of Enterprise version Hopsworks Feature Registry Offline Store (Hudi/Hive) Online Serving Historical Serving Spark Online Store (My SQL) Training Discovery Serving Pandas Ingestion API
  25. 25. Raw Data Hot Storage Event Data Stream Processing BI Tools API Batch Processing Cold Storage Workflow Automation Data Catalog Data Quality Data Security Modern Data Infrastructure
  26. 26. Feature Store Raw Data Hot Storage Event Data Stream Feature Processing Training Serving Batch Feature Processing Cold Storage Workflow Automation Data Catalog Data Quality Data Security Data Infrastructure
  27. 27. 1. Start with designing consistent ACID Data Lake before investing into Feature Store 2. Value from existing open source products does not justify investments into integration and the dependencies they bring 3. Feature Store must not bring about new infrastructure and data storage solutions. It has to be a lightweight API and SDK integrated into your existing data infrastructure. 4. Data Catalog, Data Governance, and Data Quality components are horizontal for the whole Data Infrastructure, including Feature Store 5. There are no mature open source or cloud solutions for Global Data Catalog and Data Quality monitoring. Lessons Learned
  28. 28. Data Infrastructure with Feature Store Raw Data Hot Storage Event Data Stream Processing BI Tools API Batch Processing Cold Storage Workflow Automation Training Serving Feature Store API Data Catalog Data Quality Data Security
  29. 29. Reference Architecture Raw Data Hot Storage Event Data Stream Processing BI Tools API Batch Processing Cold Storage Workflow Automation Training Serving Feature Store API Data Catalog Data Quality Data Security
  30. 30. Reference Architecture: Components Cold Storage Hot Storage Data Catalog Data Quality Great Expectations DEEQU Feature Store API ?Glue Metadata ? ?
  31. 31. Recommendations for going forward with Feature Store: 1. Make sure your existing Data Infrastructure covers 90% of Feature Store requirements (Streaming Ingestion, Consistency, Catalog, Versioning) 2. Build in-house a lightweight Feature Store API to your existing storage solutions 3. Collaborate with community and cloud vendors to maintain compatibility with standards and state of the art ecosystem 4. Be ready to migrate to managed service or an open source alternative as the market matures Recommended Strategy
  32. 32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 32 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Gandhi Raketla, Senior Solutions Architect Feature Store on AWS
  33. 33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 33 AWS Feature Storage Capabilities ✔ Reuse - Use the existing feature store pipeline developed by data engineers to re-compute and cache features in a feature store ✔ Store - Store the metadata of features such as a description, documentation, and statistical measures of features in the feature store. ✔ Discover - Make the metadata searchable through an API to ML practitioners ✔ Govern - Add a data management layer on top of the feature store for governance and access control ✔ Consume - Allow ML practitioners to query and consume features using an API to export the features for training or real-time inference
  34. 34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 34 Components Of Feature Store Storage • S3 • DynamoDB • Redis • Aurora Catalog • Glue Crawler • Glue ETL • Glue Catalog Query/API • Athena • Lambda
  35. 35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 35 Storage
  36. 36. Performance at scale Consistent, single-digit millisecond response times at any scale; build applications with virtually unlimited throughput Serverless architecture No hardware provisioning, software patching, or upgrades; scales up or down automatically; continuously backs up your data Global replication You can build global applications with fast access to local data by easily replicating tables across multiple AWS Regions Enterprise security Encrypts all data by default and fully integrates with AWS Identity and Access Management for robust security Amazon DynamoDB Fast and flexible key-value database service for any scale
  37. 37. Read scaling with replicas; write and memory scaling with sharding; nondisruptive scaling Unlimited scale AWS manages all hardware and software setup, configuration, and monitoring Fully managed In-memory data store and cache for sub-millisecond response times Consistent high performance Amazon ElastiCache Managed, Redis, or Memcached-compatible in-memory data store
  38. 38. Performance & scalability 5x throughput of standard MySQL and 3x of standard PostgreSQL; scale out up to 15 read replicas Availability & durability Fault-tolerant, self-healing storage; 6 copies of data across 3 AZs; continuous backup to Amazon S3 Highly secure Network isolation, encryption at rest / in transit Fully managed Managed by Amazon RDS: On your part, no server provisioning, software patching, setup, configuration, or backups Amazon Aurora MySQL and PostgreSQL-compatible relational database built for the cloud
  39. 39. Catalog
  40. 40. AWS Glue: Components Data Catalog ▪ Hive Metastore compatible with enhanced functionality ▪ Crawlers automatically extracts metadata and creates tables ▪ Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution ▪ Run jobs on a serverless Spark platform ▪ Provides flexible scheduling ▪ Handles dependency resolution, monitoring and alerting Job Authoring ▪ Auto-generates ETL code ▪ Build on open frameworks – Python and Spark ▪ Developer-centric – editing, debugging, sharing
  41. 41. Query
  42. 42. Amazon Athena Pay per query Pay only for queries run Save 30–90% on per-query costs through compression Use S3 storage ANSI SQL JDBC/ODBC drivers Multiple formats, compression types, and complex joins and data types SQ L Serverless: zero infrastructure, zero administration Integrated with QuickSight EasyQuery instantly Zero setup cost Point to S3 and start querying Serverless, interactive query service Analytics
  43. 43. Questions, details? We would be happy to answer! 125 University Avenue Suite 290, Palo Alto California, 94301 provectus.com

    Als Erste(r) kommentieren

Looking to design and build a centralized, scalable Feature Store for your Data Science & Machine Learning teams to take advantage of? Come and learn from experts of Provectus and Amazon Web Services (AWS) how to! Feature Store is a key component of the ML stack and data infrastructure, which enables feature engineering and management. By having a Feature Store, organizations can save massive amounts of resources, innovate faster, and drive ML processes at scale. In this webinar, you will learn how to build a Feature Store with a data mesh pattern and see how to achieve consistency between real-time and training features, to improve reproducibility with time-traveling for data. Agenda - Modern Data Lakes & Modern ML Infrastructure - Existing and Emerging Architectural Shifts - Feature Store: Overview and Reference Architecture - AWS Perspective on Feature Store Intended Audience Technology executives & decision makers, manager-level tech roles, data architects & analysts, data engineers & data scientists, ML practitioners & ML engineers, and developers Presenters - Stepan Pushkarev, Chief Technology Officer, Provectus - Gandhi Raketla, Senior Solutions Architect, AWS - German Osin, Senior Solutions Architect, Provectus Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions! REQUEST WEBINAR: https://provectus.com/webinar-feature-store-as-data-foundation-for-ml-nov-2020/

Aufrufe

Aufrufe insgesamt

170

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

0

Befehle

Downloads

9

Geteilt

0

Kommentare

0

Likes

0

×