Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Enabling a Data Mesh Architecture with Data Virtualization
1.
2. #DenodoDataFest
A Data Mesh Enabled by Data Virtualization
Creating a self-service platform
Global Director of Product Management, Denodo
Pablo Alvarez-Yanez
3. Agenda
1. What is a Data Mesh
2. What is Data Virtualization (DV)
3. How can DV Enable a Data Mesh
4. Implementation Strategies
5. Why a Data Lake alone is not Enough
5. 5
What is a Data Mesh
▪ The Data Mesh is a new architectural paradigm for data
management
▪ Proposed by the consultant Zhamak Dehghani in 2019
▪ It moves from a centralized data infrastructure managed by a
single team to a distributed organization
▪ Several autonomous units (domains) are in charge of
managing and exposing their own “Data Products” to the rest
of the organization
▪ Data Products should be easily discoverable, understandable
and accessible to the rest of the organization
6. 6
What Challenges is a Data Mesh Trying to Address?
1. Lack of domain expertise in centralized data teams
▪ Centralized data teams are disconnected from the business
▪ They need to deal with data and business needs they do not always
understand
2. Lack of flexibility of centralized data repositories
▪ Data infrastructure of big organizations is very diverse and changes
frequently
▪ Modern analytics needs may be too diverse to be addressed by a single
platform: one size never fits all.
3. Slow data provisioning and response to changes
▪ Requires extracting, ingesting and synchronizing data in the centralized
platform
▪ Centralized IT becomes a bottleneck
7. 7
How?
• Organizational units (domains) are responsible for managing and
exposing their own data
• Domains understand better how the data they own should be processed
and used
• Gives them autonomy to use the best tools to deal with their data, and
to evolve them when needed
• Results in shorter and fewer iterations until business needs are met
• Removes dependency on fully centralized data infrastructures
• Removes bottlenecks and accelerates changes
• Introduces new concepts to address risks like creating data silos,
duplicated effort and lack of unified governance
• Will be explored in the following slides
8. 8
Data as a Product
▪ To ensure that domains do not become isolated data silos,
the data exposed by the different domains must be:
▪ Easily discoverable
▪ Understandable
▪ Secured
▪ Usable by other domains
▪ The level of trust and quality of each dataset needs to be
clear
▪ The processes and pipelines to generate the product (e.g.
cleansing and deduplication) are internal implementation
details and hidden to consumers
9. 9
Self-serve Data Platform
▪ Building, securing, deploying, monitoring and managing data
products can be complex
▪ Not all domains will have resources to build this infrastructure
▪ Possible duplication of effort across domains
▪ Self-Serve: while operated by a global data infrastructure team, it
allows the domains to create and manage the data products
themselves
▪ The platform should be able to automate or simplify tasks such as:
▪ Data integration and transformation
▪ Security policies and identity management
▪ Exposure of data APIs
▪ Publish and document in a global catalog
10. 10
Federated computational governance
▪ Data products created by the different domains need to
interoperate with each other and be combined to solve new needs
▪ e.g. to be joined, aggregated, correlated, etc.
▪ This requires agreement about the semantics of common entities
(e.g. customer, product), about the formats of field types (e.g. SSNs,
entity identifiers,...), about addressability of data APIs, etc.
▪ Managed globally and, when possible, automatically enforced
▪ This is why the word ‘computational’ is used in naming this concept
▪ Security must be enforced globally according to the applicable
regulations and policies.
12. 12
Easy creation of Data Products
▪ An modern DV tool like Denodo allows for access to any
underlying data system and provides advanced data
modeling capabilities
▪ This allows domains to quickly create data products from
any data source or combining multiple data sources, and
exposing them in business friendly form
▪ No coding is required to define and evolve data products
▪ Iterating through multiple versions of the Data Products
is also much faster thanks to reduced data replication
▪ Data products are automatically accessible via multiple
technologies
▪ SQL, REST, OData, GraphQL and MDX.
13. 13
Maintains the Autonomy of Domains
▪ Domains are not conditioned by centralized, company-wide data sources (data lake,
data warehouse). Instead, they are allowed to leverage their own data sources
▪ E.g. Domain-specific SaaS applications or data marts
▪ They can also leverage centralized stores when they are the best option:
▪ E.g. use centralized data lake for ML use cases
▪ The domains can also autonomously decide to evolve their data infrastructure to
suit their specific needs
▪ E.g. migrate some function to a SaaS application
14. 14
Provides self-serve capabilities
▪ Discoverability and documentation
▪ Includes a Data Catalog which allows business users and other data consumers to quickly discover,
understand and get access to the data products.
▪ Automatically generates documentation for the Data products using standard formats such as Open
API
▪ Includes data lineage and change impact analysis functionalities for all data products
▪ Performance and Flexibility
▪ Includes caching and query acceleration capabilities OOB, so even data sources not optimized for
analytics can be used to create data products.
▪ Provisioning
▪ Automatic autoscaling using cloud/container technologies. This means that, when needed, the
infrastructure supporting certain data products can be scaled up/down while still sharing common
metadata across domains.
15. 15
Enables Federated Computational Governance
▪ The semantic layers built in the virtual layer can enforce standardized data models to represent the
federated entities which need to be consistent across domains (e.g. customer, products).
▪ Can import models from modeling tools to define a contract that the developer of the data product must
comply with
▪ Automatically enforces unified security policies, including data masking/redaction
▪ E.g. automatically mask SSN with *** except last 4 digits, in all data products except for users in the HR role
▪ Data products can also be easily combined and can be used as a basis to create new data products.
▪ The layered structure of virtual models allows creating components which can be reused by multiple domains
to create their data products.
▪ For instance, there may be virtual views for generic information about company locations, products,...
▪ Having an unified data delivery layer also makes it easier to automatically check and enforce other
policies such as naming conventions or API security standards
17. 17
A Data Mesh in a Virtualization Cluster
SQL
Operational EDW
Data Lakes Files
SaaS APIs
REST GraphQL OData
Event
Product
Customer Location Employee
1. Each domain is given a
separate virtual schema.
A common domain may be
useful to centralized data
products common across
domains
2. Domains connect
their data sources
3. Metadata is mapped
to relational views.
No data is replicated
4. Domains can model
their Data Products.
Products can be used to
define other products
5. For execution, Products
can be served directly from
their sources, or replicated
to a central location, like a
lake
7. Products can be access via
SQL, or exposed as an API.
No coding is required
Common Domain Event Management Human Resources
6. A central team can
set guidelines and
governance to ensure
interoperability
8. Infrastructure can
easily scale out in a
cluster
19. 19
A Data Lake Based Data Mesh
▪ Data Lake vendors claim that you can build a Data Mesh using the
infrastructure of a Data Lake / Lakehouse
▪ This approach tries to introduce self-service capabilities in this
infrastructure for domains to create their own data products based on
data in the lake
▪ Domains may also have independent clusters/buckets for their products
20. 20
Challenges of that approach
▪ Many domains have specialized analytic systems they would like to use
▪ e.g. domain-specific data marts
▪ The data lake may not be the right engine for every workload in every domain
▪ Domains are forced to ingest their data in the lake and go through all the process of
creating and managing the required ingestion pipelines, ELT transformations, etc. using the
data lake technology
▪ Data needs to be synchronized, pipelines operated, etc.
▪ This can be a slow process and, in addition, it forces domains to introduce in the team staff
with those complex and scarce skills
▪ If the domains are not able to acquire those skills, then they need to rely on the centralized team and
we are back to square one
21. 21
How does DV improves that?
▪ With DV, domains have the flexibility to reuse their own domain-specific data sources and
infrastructure
▪ The flexibility to use domain specific infrastructure has several advantages:
1. It allows domains to reuse and adapt the work they have already done to present data in
formats close to the actual business needs. This will typically be much faster
2. The domain probably has the required skills for this infrastructure
3. Domains can choose best-of-breed data sources which are especially suited for their data
and processes
▪ Some domains can still choose to go through the data lake process for their products, but it
does not force all domains to do it for all their products
▪ The virtual layer offers built-in ways to ingest data into the lake and keep it in synch
▪ In-lake or off-lake is a choice, not an imposition
22. 22
Additional Benefits of a DV approach
1. Reusability: DV platforms include strong capabilities to create and manage rich, layered semantic
layers which foster reuse and expose data to each type of consumer in the form most suitable for
them
2. Polyglot consumption: DV allows data consumers to access data using any technology, not only
SQL. For instance, self-describing REST, GraphQL and OData APIs can be created with a single
click. Multidimensional access based on MDX is also possible
3. Top-down modelling: you can create ‘interface data views’ which set ‘schema contracts’ which
developers of data products need to comply with.
1. This helps to implement the concept of federated computational governance.
4. Data marketplace: Ready-to-use data catalog which can act as a data marketplace for the data
products created by the different domains
5. Broad access: Even in companies that have built a company-wide, centralized data lake, there is
typically a lot of domain-specific data that is not in the lake. DV allows incorporating all that
company-global data in the data products
24. 24
Conclusions
1. Data Mesh is a new paradigm for data management and analytics
▪ It shifts responsibilities towards domains and their data products
▪ Trying to reduce bottlenecks, improve speed, and guarantee quality
2. Data lakes alone fail to provide all the pieces required for this shift
3. Data Virtualization tools like Denodo offer a solid foundation to implement this
new paradigm
▪ Easy learning curve so that domains can use it
▪ Can leverage domain infrastructure or direct them towards a centralize repository
▪ Simple yet advanced graphical modeling tools to define new products
▪ Full governance and security controls