Más contenido relacionado

Presentaciones para ti(20)

Similar a Enabling a Data Mesh Architecture with Data Virtualization(20)

Más de Denodo (20)

Enabling a Data Mesh Architecture with Data Virtualization

  1. #DenodoDataFest A Data Mesh Enabled by Data Virtualization Creating a self-service platform Global Director of Product Management, Denodo Pablo Alvarez-Yanez
  2. Agenda 1. What is a Data Mesh 2. What is Data Virtualization (DV) 3. How can DV Enable a Data Mesh 4. Implementation Strategies 5. Why a Data Lake alone is not Enough
  3. What is a Data Mesh
  4. 5 What is a Data Mesh ▪ The Data Mesh is a new architectural paradigm for data management ▪ Proposed by the consultant Zhamak Dehghani in 2019 ▪ It moves from a centralized data infrastructure managed by a single team to a distributed organization ▪ Several autonomous units (domains) are in charge of managing and exposing their own “Data Products” to the rest of the organization ▪ Data Products should be easily discoverable, understandable and accessible to the rest of the organization
  5. 6 What Challenges is a Data Mesh Trying to Address? 1. Lack of domain expertise in centralized data teams ▪ Centralized data teams are disconnected from the business ▪ They need to deal with data and business needs they do not always understand 2. Lack of flexibility of centralized data repositories ▪ Data infrastructure of big organizations is very diverse and changes frequently ▪ Modern analytics needs may be too diverse to be addressed by a single platform: one size never fits all. 3. Slow data provisioning and response to changes ▪ Requires extracting, ingesting and synchronizing data in the centralized platform ▪ Centralized IT becomes a bottleneck
  6. 7 How? • Organizational units (domains) are responsible for managing and exposing their own data • Domains understand better how the data they own should be processed and used • Gives them autonomy to use the best tools to deal with their data, and to evolve them when needed • Results in shorter and fewer iterations until business needs are met • Removes dependency on fully centralized data infrastructures • Removes bottlenecks and accelerates changes • Introduces new concepts to address risks like creating data silos, duplicated effort and lack of unified governance • Will be explored in the following slides
  7. 8 Data as a Product ▪ To ensure that domains do not become isolated data silos, the data exposed by the different domains must be: ▪ Easily discoverable ▪ Understandable ▪ Secured ▪ Usable by other domains ▪ The level of trust and quality of each dataset needs to be clear ▪ The processes and pipelines to generate the product (e.g. cleansing and deduplication) are internal implementation details and hidden to consumers
  8. 9 Self-serve Data Platform ▪ Building, securing, deploying, monitoring and managing data products can be complex ▪ Not all domains will have resources to build this infrastructure ▪ Possible duplication of effort across domains ▪ Self-Serve: while operated by a global data infrastructure team, it allows the domains to create and manage the data products themselves ▪ The platform should be able to automate or simplify tasks such as: ▪ Data integration and transformation ▪ Security policies and identity management ▪ Exposure of data APIs ▪ Publish and document in a global catalog
  9. 10 Federated computational governance ▪ Data products created by the different domains need to interoperate with each other and be combined to solve new needs ▪ e.g. to be joined, aggregated, correlated, etc. ▪ This requires agreement about the semantics of common entities (e.g. customer, product), about the formats of field types (e.g. SSNs, entity identifiers,...), about addressability of data APIs, etc. ▪ Managed globally and, when possible, automatically enforced ▪ This is why the word ‘computational’ is used in naming this concept ▪ Security must be enforced globally according to the applicable regulations and policies.
  10. Enabling a Data Mesh with Data Virtualization
  11. 12 Easy creation of Data Products ▪ An modern DV tool like Denodo allows for access to any underlying data system and provides advanced data modeling capabilities ▪ This allows domains to quickly create data products from any data source or combining multiple data sources, and exposing them in business friendly form ▪ No coding is required to define and evolve data products ▪ Iterating through multiple versions of the Data Products is also much faster thanks to reduced data replication ▪ Data products are automatically accessible via multiple technologies ▪ SQL, REST, OData, GraphQL and MDX.
  12. 13 Maintains the Autonomy of Domains ▪ Domains are not conditioned by centralized, company-wide data sources (data lake, data warehouse). Instead, they are allowed to leverage their own data sources ▪ E.g. Domain-specific SaaS applications or data marts ▪ They can also leverage centralized stores when they are the best option: ▪ E.g. use centralized data lake for ML use cases ▪ The domains can also autonomously decide to evolve their data infrastructure to suit their specific needs ▪ E.g. migrate some function to a SaaS application
  13. 14 Provides self-serve capabilities ▪ Discoverability and documentation ▪ Includes a Data Catalog which allows business users and other data consumers to quickly discover, understand and get access to the data products. ▪ Automatically generates documentation for the Data products using standard formats such as Open API ▪ Includes data lineage and change impact analysis functionalities for all data products ▪ Performance and Flexibility ▪ Includes caching and query acceleration capabilities OOB, so even data sources not optimized for analytics can be used to create data products. ▪ Provisioning ▪ Automatic autoscaling using cloud/container technologies. This means that, when needed, the infrastructure supporting certain data products can be scaled up/down while still sharing common metadata across domains.
  14. 15 Enables Federated Computational Governance ▪ The semantic layers built in the virtual layer can enforce standardized data models to represent the federated entities which need to be consistent across domains (e.g. customer, products). ▪ Can import models from modeling tools to define a contract that the developer of the data product must comply with ▪ Automatically enforces unified security policies, including data masking/redaction ▪ E.g. automatically mask SSN with *** except last 4 digits, in all data products except for users in the HR role ▪ Data products can also be easily combined and can be used as a basis to create new data products. ▪ The layered structure of virtual models allows creating components which can be reused by multiple domains to create their data products. ▪ For instance, there may be virtual views for generic information about company locations, products,... ▪ Having an unified data delivery layer also makes it easier to automatically check and enforce other policies such as naming conventions or API security standards
  15. Implementation Strategy
  16. 17 A Data Mesh in a Virtualization Cluster SQL Operational EDW Data Lakes Files SaaS APIs REST GraphQL OData Event Product Customer Location Employee 1. Each domain is given a separate virtual schema. A common domain may be useful to centralized data products common across domains 2. Domains connect their data sources 3. Metadata is mapped to relational views. No data is replicated 4. Domains can model their Data Products. Products can be used to define other products 5. For execution, Products can be served directly from their sources, or replicated to a central location, like a lake 7. Products can be access via SQL, or exposed as an API. No coding is required Common Domain Event Management Human Resources 6. A central team can set guidelines and governance to ensure interoperability 8. Infrastructure can easily scale out in a cluster
  17. Isn’t a Data Lake Enough?
  18. 19 A Data Lake Based Data Mesh ▪ Data Lake vendors claim that you can build a Data Mesh using the infrastructure of a Data Lake / Lakehouse ▪ This approach tries to introduce self-service capabilities in this infrastructure for domains to create their own data products based on data in the lake ▪ Domains may also have independent clusters/buckets for their products
  19. 20 Challenges of that approach ▪ Many domains have specialized analytic systems they would like to use ▪ e.g. domain-specific data marts ▪ The data lake may not be the right engine for every workload in every domain ▪ Domains are forced to ingest their data in the lake and go through all the process of creating and managing the required ingestion pipelines, ELT transformations, etc. using the data lake technology ▪ Data needs to be synchronized, pipelines operated, etc. ▪ This can be a slow process and, in addition, it forces domains to introduce in the team staff with those complex and scarce skills ▪ If the domains are not able to acquire those skills, then they need to rely on the centralized team and we are back to square one
  20. 21 How does DV improves that? ▪ With DV, domains have the flexibility to reuse their own domain-specific data sources and infrastructure ▪ The flexibility to use domain specific infrastructure has several advantages: 1. It allows domains to reuse and adapt the work they have already done to present data in formats close to the actual business needs. This will typically be much faster 2. The domain probably has the required skills for this infrastructure 3. Domains can choose best-of-breed data sources which are especially suited for their data and processes ▪ Some domains can still choose to go through the data lake process for their products, but it does not force all domains to do it for all their products ▪ The virtual layer offers built-in ways to ingest data into the lake and keep it in synch ▪ In-lake or off-lake is a choice, not an imposition
  21. 22 Additional Benefits of a DV approach 1. Reusability: DV platforms include strong capabilities to create and manage rich, layered semantic layers which foster reuse and expose data to each type of consumer in the form most suitable for them 2. Polyglot consumption: DV allows data consumers to access data using any technology, not only SQL. For instance, self-describing REST, GraphQL and OData APIs can be created with a single click. Multidimensional access based on MDX is also possible 3. Top-down modelling: you can create ‘interface data views’ which set ‘schema contracts’ which developers of data products need to comply with. 1. This helps to implement the concept of federated computational governance. 4. Data marketplace: Ready-to-use data catalog which can act as a data marketplace for the data products created by the different domains 5. Broad access: Even in companies that have built a company-wide, centralized data lake, there is typically a lot of domain-specific data that is not in the lake. DV allows incorporating all that company-global data in the data products
  22. Conclusions
  23. 24 Conclusions 1. Data Mesh is a new paradigm for data management and analytics ▪ It shifts responsibilities towards domains and their data products ▪ Trying to reduce bottlenecks, improve speed, and guarantee quality 2. Data lakes alone fail to provide all the pieces required for this shift 3. Data Virtualization tools like Denodo offer a solid foundation to implement this new paradigm ▪ Easy learning curve so that domains can use it ▪ Can leverage domain infrastructure or direct them towards a centralize repository ▪ Simple yet advanced graphical modeling tools to define new products ▪ Full governance and security controls
  24. © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies. Thank You!