All Things Open 2016 Talk - discussing technologies used to augment traditional data warehousing. Those technologies are:
* data vault
* anchor modeling
* linked data
* NoSQL
* data virtualization
* textual disambiguation
2.
Principal Consultant (Data and Analytics),
CSpring Inc.
Business Analytics Adjunct Professor, Wake
Tech
MS in Business Intelligence
Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex
3.
4.
5. Agenda
(Brief) History of why data warehousing
The challenges
Three paths
− Traditional
− NoSQL
− Hybrid
Q&A
Please feel free to ask questions throughout the presentation!
6. Why Data Warehouses?
Started being discussed in 1970
While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model
Reporting was hard
− System/application queries were not the same as management reporting queries
7. Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process
12. Traditional Model – Challenges
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
13. Traditional Model – Challenges
What about all that
“unstructured” data?
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
14. Traditional Model – Challenges
What about all that
“unstructured” data?
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
What about data
scientists?
15. A New Use Case
Traditional DW doesn’t meet the demand of the data science
workforce
Only gets to the ‘what happened’ and ‘why’.
17. Data Vault
Hybrid between 3NF and star schema
Created by Dan Linstedt
Persistent data layer – keep everything
Bring data over as needed
− Once touching an object, bring it all over
Can be hybrid between relational databases and Hadoop
Massive parallel loading, eventual consistency (with Hadoop)
18.
19.
20. Pros and Cons
PRO
− Easily leverage existing
infrastructure
− Faster iterations between source
and solution
Especially as objects are
brought over
− Can offload historical data into
Hadoop
− Learning curve
Simple to pick up
CON
− Table joins
− Inter-dependencies between
objects
− Documentation not widely
available (outside of commercial
website and book)
22. Anchor Modeling
Store data how it is and how it was
− Structural changes and content changes
Created by Lars Rönnbäck
Persistent data layer – keep everything, including how the data was
structured
Highly normalized (6NF)
25. Pros and Cons
PRO
− Stores data and data structure
temporally
− Designed to be agile
− Reduces storage
CON
− Joins
− High normalization makes for
difficult usage
Views mask this complexity
− Some data stores aren’t able to
handle this normalization level
− BI tools aren’t designed for this
type of modeling
−
27. Linked Data Stores (Triple Stores)
Store data with semantic information
Created by Tim Berners-Lee
Removes/eliminates ambiguity in data
Standardizes data querying (SPARQL)
Can interface with all other linked data sources
− Public sources referenced and integrated by calling them
− Private sources work the same way, provided permissions allow
Graph data stores are a specialized type of triple store
− Store data on edges
33. Pros and Cons
PRO
− Clearly defined business logic
− Fast iterations on ontology
− Single, unified querying language
− Can join datasets via PREFIX with
no additional work
CON
− BI tools still playing catch-up
− Tool ecosystem is small
But Awesome!
− Few organizations have adopted
(but this is changing)
34. Other NoSQL
Columnar
− Designed with queries in mind
− Some are tuned for star schema
performance
Document Stores
− Designed with data/queries in mind
− Key-value stores
Object Stores
− Data stored as objects
− Merger of database and
programming
Others
− New types are still being created
− Watch out for flavors of the month
36. Data Virtualization
Integration is logical – not physical
Doesn’t matter what type of data is being integrated*
− NoSQL
− Relational
Allows for more traditionally designed tools to access more modern
data stores
Allows for easier, more iterative work flows
Business logic lives in the integration layer
40. Pros and Cons
PRO
− Easily leverage existing
infrastructure
− Faster iterations between source
and solution
− Integration between NoSQL and
RDBS simplified
− Can keep data warehouse and
augment as needed
− Uses SQL
− Self-documenting
CON
− Joining can be intensive
− Large memory, compute
requirements
− Heavy loads on source systems
Can offload to virtualization
shards
41. Textual Disambiguation
Take ‘unstructured’ data and interpret
context
Store disambiguated data in RDBMS
(9th
normal form)
Augment traditional data warehouse
with new unstructured data.
42.
43. Pros and Cons
PRO
− Easily leverage existing
infrastructure
− Closes the gap between
unstructured data and traditional
data
− Clear understanding and
interpretation of unstructured data
CON
− Full language context required
− Slang, acronyms, etc. can be a
problem
− Time to delivery varies
Multiple language barrier
Defining context
− Non-agile
Hard to break data down into
smaller components
44. Conclusion
Business Intelligence has to move forward
− Remove legacy tools that haven’t evolved past reporting
− Tweak platform to support agile, incremental change
Businesses are already demanding more
− Faster turn around
− More access
− Deeper insights
Is your team ready to make the move?
Hinweis der Redaktion
So here’s a bit about me. There are three things I’m going to ask of you, the first being – please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.
The second thing I’ll ask is to be aware that some of these solutions may fix your particular problems and you’ll iterate on them and we’ll find them super-awesome and maybe you’ll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done – they are designed to be seamless and make users lives easier.
The final ask before we get fully started is please don’t be the pointy-haired boss! We’re covering a lot of topics at a very high level and a lot of nuances aren’t being discussed (it’s only a 40 minute presentation after all). Please dig further and ask plenty of questions.
By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.
The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.
Into that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.
His approach, now considered a bottom-up design integrates data from various OLTP systems, tie them together in a 3NF data model and make those data sets available for reporting. This ties in with the other process that came along a bit after Inmon – the star schema.
Another gentleman named Ralph Kimball took the data warehousing concept a step further. Considered a top-down approach, the data from the data warehouse is now transformed – or conformed – to match the reporting and analysis needs of the business. While many arguments were had and many organizations went pure conformed dimensional model for their data warehouse, the correct way to model is with both a 3NF backend and a star schema on top.
With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I don’t have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!
Speed of integrating data is a huge problem. Working to cleanse, conform, and process all those different sources into a single warehouse is one thing. Getting the business agreeing on logic and formulas to populate the star schema is also a challenge, triggering many iterations on the integration layer.
It’s also a challenge to bring new sources online. Because of the nature of an Inmon data warehouse, it’s typical to bring the entire source over so that history is tracked across the entire source. In addition, how will logical changes be managed both from the source to warehouse and the warehouse to star schema? Without the 3NF layer, the star schema can’t be reloaded without losing all the history that was collected.
Then of course the other question is how to handle the big yellow elephant?
On top of all those other problems, we also have to address a whole new customer base – data scientists! They need to have access to data faster and more broader than any other customer base before. Yet, they can’t just access data from the data warehouse because the data is too clean to be of real use.
There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two – what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics – applying the predictions found and making automated decisions based on them.
Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/
Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once.
The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile.
Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/
Here is our basic example that we’ll be using through the rest of this presentation. It’s a simple student/teacher/class model that, while not modeled 100% ‘correct’, will provide a good example going forward.
Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.
There’s not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.
Another of the new modeling techniques is anchor modeling. In this model, data is stored in a highly normalized format that focuses both on the actual data but also the context and model over time. As the data model changes, new structures can be made in the anchor model.
There is only one open modeling tool that supports anchor modeling, and that’s on the anchor modeling website directly. Some commercial tools do provide support, but there aren’t many. That said, this is one of the many examples from the website. Each entity becomes an anchor and data about the entity is tied together. This model also removes duplication of data. For instance, if a teacher and a student both had the name ‘Mary’, it would only be stored once and be referenced to both anchors.
There’s not a lot of documentation out in the wild, with the exception of the website and many presentations and white papers.
Linked data (also known as triple stores) was created by Tim Berners-Lee around the same time as the web was created. Linked Data removes the ambiguity of typical data stores by translating the data model into a clear vocabulary. The other bonus is that there is only one single, unified querying language. When it comes to other linked data sources, it’s easy to join data sets together by adding a new prefix to a query.
Graph data stores are a subtype of triple store in which data is stored in a network graph – think seven degrees of Keven Bacon.
Again, using the example from before.
Using that model, here we have an example of triples. A triple is made up of three parts: subject, predicate, and object. For instance, a student has a first name of Arnold. Another would be that Arnold is a Student and that a Student is a subclass of Person.
RDF is another way of formatting the data in triples. Now there are other formats, but RDF/XML is one of the more common transport mechanisms since most tools can read XML. The same kinds of triples mentioned in the previous slide can be seen here.
SPARQL is similar enough to SQL to be familiar but different enough to require some tutorials ;) Here we are looking at our school data (as noted by the school prefix) and retrieving all students’ first names that are in Third Grade. The WHERE clause has three triple statements to bring the result set back. Each triple is denoted by a period.
There are a few books on linked data but these are two of the better of the bunch. The Manning publication is a great overview of Linked Data while the Semantic Web book focuses on building web ontologies (the vocabulary like we discussed earlier).
The are many other types of NoSQL databases, but not enough time to cover here. They can still be useful in augmenting traditional data warehouses.
Data Virtualization is a great way to bridge the gap between NoSQL and SQL based tools. This allows for traditional business intelligence tools to access data stores that they wouldn’t normally be able to. The cool thing about virtualization tools is that business logic lives in this integration layer – allowing for faster changes to the process that builds the data endpoint.
With all the various sources, the virtualization tool will have one or many translation layers. These translation layers interpret the data between the source system and SQL. Between the initial translation layer and the final virtual data marts are any number of rules layers. These rules layers act in a similar manner to ETL (data integration) but they are inside the virtualization tool. From there, data marts can be created virtually as well. At any of those layers changes can be made quickly and will immediately impact the layers above the one where changes are made.
With data virtualization, traditional tools can continue to access data marts, both virtual and real. In addition, tools that can access the source systems can go either into the virtualized layers or access the systems directly, depending on the use case/need.
The final method we’ll be discussing is some of Inmon’s latest work – Textual Disambiguation. At it’s basic core, the methodology takes unstructured data and interprets it into it’s language components and defines it’s textual context. From there, the data can be stored in an even higher normalized form that we discussed with anchor modeling and augment the traditional warehouse with a veritable cornucopia of new information that can be utilized using SQL.