Learn the tips and tricks how to handle Data Modeling in your Big Data environment. Mark will show how modeling will add value to the business and how to make your Big Data landscape transparent across the organization.
You will see the latest modeling techniques for Big Data and different types of modeling notations. Also you will learn how to integrate Data Modeling into your BI environment.
2. Introduction
⢠My name is Mark Barringer
(mark.Barringer@embarcadero.com)
⢠Product Manager (Data Architecture Tools) Embarcadero
⢠I live in the beautiful city of Winchester in southern England
2
3. Agenda and ObjectivesâŚ
3
Data Challenges
What are the modern day challenges for the Data
Architect
Big Data
Landscape
Adding Value to
the Business
Data Modelling
Techniques
How to understand and view the Big Data
Landscape
Making the Big Data Landscape transparent to the
business and how to add real value
A look at some of the latest Data Modelling Tips and
Tricks applied to the Big Data Environment
4. Challenges facing Data Architecture
4
Federation
Data Democratisation
Platform Fragmentation
Data Lineage
Latency
Delivery
Obfuscation
5. Challenges facing Data Architecture
5
Federation: The application of a single view over multiple repositories.
Data Democratisation
Platform Fragmentation
Data Lineage
Latency
Delivery
Obfuscation
6. Challenges facing Data Architecture
6
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation
Data Lineage
Latency
Delivery
Obfuscation
7. Challenges facing Data Architecture
7
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation: The proliferation of non-RDBMS solutions to store data.
Data Lineage
Latency
Delivery
Obfuscation
8. Challenges facing Data Architecture
8
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation: The proliferation of non-RDBMS solutions to store data.
Data Lineage: Expectation to understand actions performed on the data.
Latency
Delivery
Obfuscation
9. Challenges facing Data Architecture
9
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation: The proliferation of non-RDBMS solutions to store data.
Data Lineage: Expectation to understand actions performed on the data.
Latency: The trend towards lower end-to end latency of data (from creation to reporting)
Delivery
Obfuscation
10. Challenges facing Data Architecture
10
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation: The proliferation of non-RDBMS solutions to store data.
Data Lineage: Expectation to understand actions performed on the data.
Latency: The trend towards lower end-to end latency of data (from creation to reporting)
Delivery: Model development in step with development teams.
Obfuscation
11. Challenges facing Data Architecture
11
Federation: The application of a single view over multiple repositories.
Data Democratisation: The expectation of the business to have more control over the data
assets.
Platform Fragmentation: The proliferation of non-RDBMS solutions to store data.
Data Lineage: Expectation to understand actions performed on the data.
Latency: The trend towards lower end-to end latency of data (from creation to reporting)
Delivery: Model development in step with development teams.
Obfuscation: Expectation to view and understand data models by the business.
12. Why Data Modelling is EssentialâŚ
12
Modelling
the Business
Understand
the
Landscape
Self
Documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
13. Why Data Modelling is EssentialâŚ
13
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
Business Modelling
⢠Meaningful abstracted view
of the business
⢠Data-centric perspective
⢠'Anchor point' for other
models
⢠Key to successful
communication
⢠Develop credibility and
relevance with the business
⢠Establish Business Glossaries
with consistent definitions
⢠Build a solid foundation for
Compliance, Data Governance
and Master Data
Management
⢠Improve visibility and
collaboration with ER/Studio
15. Who are the Data customers and collaborators?
15
DA
Technical Collaborators
(DBA, ETL, SA)
(MetaData Consumers)
Data Analysts
(Finance, Credit, Mktg.)
(Data Consumers)
Business Users
(Information Producers)
16. Benefit of Relating Metadata to Models
⢠Expand the depth of information
by accessing the underlying
framework
16
⢠Models and terms seamlessly integrate to
one another
17. Why Data Modelling is EssentialâŚ
17
Modelling
the business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
Understand the Landscape
⢠Create Landscape Inventories
⢠Reverse Engineer
⢠Overcome Information
Obscurity
⢠Eliminate Data Silos
⢠Contain much of the detailed
meta data
⢠Useful for impact and gap
analysis
⢠Platform agnostic
⢠Map to concepts in the
Conceptual Data Model and
physical objects
19. Big Data & NoSQL: The Challenge
⢠Capture new data sources and increase
information management footprint
⢠Understanding semi- and un-structured data
⢠âRaw Data is the single source of the truthâ
⢠7 Vs
⢠Velocity, Volume, Variety,
⢠Veracity (conformity of facts â Data in doubt)
⢠Variable, Virtual, Value
⢠Reverse & Forward Engineering (JSON, BSON)
⢠Forward & Reverse Engineer DDL
⢠âWe will hopefully find what we didnât know about
that we didnât know that we didnât know aboutâ
20. Eliminate Data Silos: Inventory existing databases
⢠What type of data do you own and where can it be found.
⢠Map your data landscape using data models as the foundation.
â Each model represents a different database system
â Link like data elements together for traceability
20
21. Why Data Modelling is EssentialâŚ
21
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
Physical Modelling for Big Data
⢠Accurately model all types of
data at rest within the
organization.
⢠Not just RDBMS resident data
⢠Document physical meta-data
(table space, partitions, etc)
⢠Introduce non-RDBMS data
stores e.g. NOSQL, JSON, HIVE
⢠Build many physicals based
on business decomposition
⢠Reverse Engineering
22. Traditional (RDBMS) Prescriptive Data Modelling
MODEL
(and
DESIGN)
LOAD
EXPLORE/
QUERY
DATA EXPLORE
âSchema on writeâ
Good for
Known
Unknowns
(Repetition)
23. Big Data (NoSQL/Hadoop) Descriptive Data Modelling
LOAD QUERY MODEL
NoSQL STORE
EXPLORE
But Fast and Agile!
âSchema on readâ
Good for
Unknown
Unknowns
(Exploration)
24. Data modeling in Big Data
Customer
-
-
-
NoSQL DATABASE
Documents
-
-
-
Product
-
-
-
Key values
-
-
-
Conceptual/
business data model
Understanding
Logical/physical
data model
Architecture/Design
RELATIONAL
DATABASE
(i.e., Data
warehouse/data mart)
May transfer into structured database
(using models)
25. Why Data Modelling is EssentialâŚ
25
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
Timely Design
⢠Ensuring changes to the
physical data models are in
step with and relevant to the
development methodology
used in the organization.
⢠Where modelling meets
development.
⢠Create credibility and
relevance with the
development teams
⢠User Stories, Tasks and
Change Management
26. Agile Change Management
⢠Enable âAgile Data Modelerâ
â Incremental rather than waterfall
⢠Need more granularity than named
versions of a model or submodel
⢠Change numbers at Repository check in
⢠Can be associated to user stories, tasks
26
27. Why Data Modelling is EssentialâŚ
27
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
Physical models
Data Modelling
Techniques
Agile
Development
Data Modelling Techniques
⢠Sub-Modelling
⢠Business Decomposition
⢠Visual Data Lineage
⢠Impact Analysis / Where Used
⢠Naming Standards
⢠Data Source Mapping
⢠Universal Mapping
⢠Augmented metadata
⢠Glossary Integration
28. Big Data Notation Enhancement
⢠Physical Model
â Objects instead of
tables
⢠Nested Objects
â âis contained inâ
relationship type
28
34. Technique: Glossary Integration
⢠Associate Data Architect objects to Business glossary terms
â Model, submodel
â Entity, Table
â Attribute, Column
â Domain
â View
⢠Push terms to glossary
34
35. Why Data Modelling is essentialâŚ
35
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big Data
Physical models
Data Modelling
Techniques
Agile
Development
Business metadata
⢠Provide the business with the
ability to centrally manage its
own meta data in terms of
definitions, rules and
relationships in a structured
and curated manner.
⢠Facilitate the binding of the
business elements to
technical elements within the
models and other
documentation.
⢠Data Dictionary
⢠Self-Service Discovery and
Reporting
36. Providing Business Context
A taxonomy of searchable terms mapped to unique concepts
R&D
Entity
Business Term
Patient Recruitment
Data
Attribute
Business Term
Batch Supply Data
DatasourcePhysical Model
Column
Logical diagram
Table
Clinical
Discussion
threads
Conceptual &
Process Diagrams
Glossaries
Supply Chain
37. Why Data Modelling is essentialâŚ
37
Modelling
the Business
Understand
the
Landscape
Self
documenting
Business
Heterogeneous
& Big data
Physical models
Data Modeling
Techniques
Agile
Development
Holistic, integrated
modelling that can
present the same meta
data to different
audiences in the most
appropriate format.
The single most important
challenge to overcome is
that of communication and
collaboration and to Build
Trust in Data.
Without the ability to
communicate effectively to a
wide variety of audiences even
the most diligently documented
organisation will be unable to
benefit from it.
39. Win a
FitBit Charge HR
Leave a
Business Card
at the
Barnsten /
Embarcadero
Stand
Raymond Horsten
(r.horsten@barnsten.com)
Mark Barringer
(mark.barringer@embarcadero.com)
41. Building Trust in Data : Collaboration
Syndication
Governance and Collaboration
Technical
Metadata
Business
Metadata
Metadata
Repository
Data Modeling Team Server Web
Architecture Business
SDLC &
Information
Management
Integrated
Tooling
Enterprise
Data
Hinweis der Redaktion
Data Modelling as was...
Conceptual Data Model: this is a concept that data modellers rave about but few others understand. Many orgs donât have one and are quite happy. Some orgs have one but donât use it and consider it a waste of money this leaves a (proportionally) vey small group of organisations that have a CDM and use it.
Most organisations have a large monolithic Enterprise Data Warehouse â this acts a focus for most of the modelling activity in an organisation
It can be argued that the EDW is a de facto Logical data model of the organisation
Whilst the EDW serves a very useful purpose in most organisations it does have its limitations many of which seem to be brought into sharp relief in recent years.
It is usually seen as an IT function black hole in terms of resources and requirements
DA = Data Architect
DBA = Database Administrator
ETL = Extract, Transform, Load developer
SA = System Analyst
As business analysts and data analysts seek to become stronger bridges between business and IT, they become power users of data management tools and would need access to business definitions in data management tools. ER/Studio Team Server helps to expand the circle of data comprehension.
When new versions of DBPS come out, they will be well integrated with ER/Studio Team Server. So DBAs would be interested in ER/Studio Team Server as well
Introduce first benefit to the industry â pose pain points â introduce Rob to demo and then field questions/input
So why can't organizations make more effective use of information? Â In short, it's information obscurity. Â You see, enterprise data isn't just BIG, it's complex. Â Our enterprise customers have hundreds of systems and hundreds of thousands of data elements. Â If Sales says a customer is anyone they're calling on, if Finance says it's anyone who's paid us money, and Support says it's anyone with paid-up maintenance who's really a Customer? Â If Customer data is in a hundred different systems, if it's escaped the data center on mobile devices or migrated to the cloud, where do I go to find the right data, and how do I interpret it? Â It's no surprise that most organizations can't leverage all of their data - in many cases, users can't even find it. Â
Transition: Many organizations turning to data governance as a way forward.
Big data & NoSQL provide huge potential to explore, understand and gain value from data sets that are either too large, complex or diverse to fit into traditional database management systems.
They enable you to capture new types of semi-structured or unstructured data sources in their raw format, with the goal of providing the raw data as the single source of the truth.
The other reason why Big Data & NoSQL platforms have become so popular are because of the low price and high performance ratio that they can provide in comparison with traditional databases, especially when having to store huge amounts of data.
However this comes with additional challenges of managing and understanding all of this data. You may be aware of the 3 Vs : Velocity, Volume and Variety , but recently 4 more have been added. Especially:
Veracity: Regarding the certainty of the data
Variable: Can have variable schemas, variable ways of interpreting the same data: E.g. from the customer perspective, or vendor perspective. (This leads to schema on read so that you have a different emphasis when reading.)
Virtual: Virtualization of data source
Value: The Most important is extracting value of the data
Managing this data , this big data effectively, will hopefully lead us to uncover the Unknown Unknowns about the data. Or put differently âWe will hopefully find what we didnât know about that we didnât know that we didnât know aboutâ
The Unknown Unknowns are potentially big opportunities that fall far outside of the day-to-day of your business. These include things like:
Market segments you havenât discovered yet
Features that people love
Product innovations
The âwhy?â behind customer behaviours
But in order to manage all of this data, you will first need to understand the schema governing or describing the data. And I would like look briefly at some of the schema differences when modeling Big Data & NoSQL platforms vs traditional databases.
---------
Hadoop and MongoDB are the enterprise Big Data Leaders
Hadoop â cost-efficient staging and ETL of very large data sets
MongoDB â Customer/User profile management for large-scale web sites
Greenplum â Innovative MPP analytic database technology â now part of EMC Pivotal Labs
In a big data environment, data is usually stored in a noSQL database, meaning a non-relational data store. In this data store, there are no relationships that exist. However, categories are built based upon the unstructured data and data is analyzed using various tools (for example, Hive in the Hadoop environment). NoSQL (ânot only SQLâ) environments such as Hadoop is âschema on readâ versus âschema on writeâ. For example, Hadoop uses an HDFS for its file structure and it is a âschema on readâ meaning that we donât need to define the data structure before loading the data (which we would need to do if it were a traditional âschema on writeâ such as a relational data warehouse).
The pros of this approach are:
It is not necessary to define the structure and therefore there is great flexibility in how the data (structured and especially semi-structured) can be stored, queried, and used.
It promotes experimentation
The cost of getting things âwrongâ has a very low cost (since it was only experimental)
It is an agile approach since it speeds up the time to have the data available versus having to first model, develop ETL, perform data quality, etc.
The cons of this approach are:
Expensive because compute resources need to be high
It is not self documenting
Have to create jobs that creates the schema on read
Data modeling may take place after data is analyzed in order to:
Understand the data
Provide a design for transfer into a relational data store, if needed and decided upon
Investigating computing is an environment where we can experiment. Once we have decided that some of the information is useful and what can be used in production, then we may decide to transfer it into a structured environment (EDW) and this is where data modeling is needed.
Many organizations have struggled making the transition from traditional âwaterfallâ data modeling to more responsive and iterative agile approaches. An important aspect of this is granular change management, enabling checkout/checkin of only those objects required for a specific modification, rather than a full model or sub-model. Just as important, is knowing âwhyâ the changes were made. Therefore, object checkin/checkout can now be associated with a specific task or âuser storyâ, which is a practice agile developers have been using for years. Knowing the âwhyâ is also extremely important from a data governance perspective.
Because of Big Data, we have had to enhance our notations to accommodate new types of physical models
We are now using Objects instead of Table in the Physical Models.
The big thing with the Big Data stores, is that we can have nested objects in those structures.
We have introduced a new relationship type that only shows on Big Data platforms, and that is the Is Contained In relationship type.
An weâll see shortly in the demo that we can handle nested objects, and nested arrays of objects using these notations.
In the Diagram, the âDiamond on one end, with cardinality on the other endâ corresponds to the âIs Contained Inâ relationship type. We have utilized that notation convention from UML. So that those who are familiar with that notation, should have an easy transition to our tool.
This is the JSON code for a couple of the collections in MongoDB. You may be familiar with JSON, it contains objects called collections, which usually contain nested objects. These objects typically contain key value pairs.
++ Show SLIDE : Containment Relationship: Array of Nested Objects
* On the LEFT side you see PATRON : If you look down you'll see ADDRESS there.
In the Diagram the relationship line from PATRON to ADDRESS had the star. We can see that there are 2 Addresses referenced in there.
=> You will also notice that After ADDRESS there is a square brackets "[" => And that kicks off the ARRAY of NESTED OBJECTS. When reverse engineering that "[" indicates to the tool that this is an ARRAY of NESTED OBJECTS.
* On the RIGHT Side you see BOOK:
Notice that CHECKOUT has after the colon (:) the "[" . But although there is only has 1 instance of checkout there, the syntax indicates that this is an ARRAY.
=> Moving to the Next slide of single nested objects.
In HIVE you can reverse engineer and forward engineer the DDL, this looks like another DB platform. You just click on the DDL tab, and weâll see what this looks like in the tool in a few minutes.
That was a quick overview, but it is much more fun to see it in action, so lets move over to the demo.
---
Hive is a SQL like query language in Hadoop. Even though data is not stored in tables. It is normally text, e.g.: commar delimited.
-> Hive applies a schema to it.
Note that attribute names in Hive would not make a Data Modeler happy (e.g.: n1 int, mars2 int, prep int,etc..). It is almost like a non-materialized view on top of a text file.
Naming Standards Automation: Currently we invoke our naming standards manually, applying to a submodel at a time. The auto naming standards will allow us to bind a naming standards template to data model objects such as entities/tables and attributes/columns. The typical use case would be to have the physical name change in place as we are editing the logical name (thatâs how ERwin does it in a far less elegant fashion through their macro formulas). We will also be able to apply physical to logical mapping (reverse direction) if that is desired. Standards can be attached to individual objects, or defaulted at the sub-model level.Â
We are adding the capability for real-time integration between the Data Architect tool and the glossaries/terms in Team server. If a defined glossary term is used in naming or defining major model objects, those terms are automatically highlighted with capability to hyperlink and/or copy the term definition to the model object. Mouse-over the highlighted term will show the definition from the Team Server glossary.
We will be extending this capability further later this year to include glossary integration with process artifacts in the Business Architect tool as well.
We have a broad portfolio of products spanning Data Modeling, Database Management and Application Development.
We often refer to the Design , Develop and Deliver areas.
We can see on the left hand side of the diagram the Data Architecture products that we will look at today, together with the Database Administrator and Database Developer products at the top and right that we will look at DB PowerStudio refresher next week. And at the Center, we have the Team Server Core, which ties all of these together.
On the right of the slide, you can see some of the many database platforms that we support, and this list is constantly expanding.
Metadata Governance and Syndication
Applying the principles of Governance to metadata collaborative authoring
Unleashing metadata by delivering it into core SDLC and Information Management workflows
Value Propositions:
Data Architect
To: Peer Architects
Benefit: Reuse, consistency, coordination
Data Architect
To: Business Analyst, Developer
Benefit: Review and approval, coordination
Steward, Data SME, Governance Team
To: Metadata
Benefit: Collaborative authoring, business review and approval
Information Management Professionals
To: Metadata
Benefit: Discoverability, comprehension, quality, access to policy
Data Analyst
To: Metadata
Benefit: Discoverability, definitions, lineage, security and sensitivity advisories