Building next generation data warehouses

Building Next Generation
Data Warehouses
All Things Open 2016
Alex Meadows


Principal Consultant (Data and Analytics),
CSpring Inc.

Business Analytics Adjunct Professor, Wake
Tech

MS in Business Intelligence

Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex

Agenda

(Brief) History of why data warehousing

The challenges

Three paths
− Traditional
− NoSQL
− Hybrid

Q&A


Please feel free to ask questions throughout the presentation!

Why Data Warehouses?

Started being discussed in 1970

While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model

Reporting was hard
− System/application queries were not the same as management reporting queries

Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process

Bill Inmon

Bottom-up design

Integration of source systems

Third Normal Form

Ralph Kimball

Make data accessible

Top-down approach

Dimensional models (star
schema)

Traditional Model – Challenges
How can I get my
data integrated faster?

How long to get new data sources online?
How to handle business logic changes?
How can I get my

What about all that
“unstructured” data?
How can I get my

What about all that
“unstructured” data?
How can I get my
What about data
scientists?

A New Use Case

Traditional DW doesn’t meet the demand of the data science
workforce

Only gets to the ‘what happened’ and ‘why’.

Traditional
Iterations On Existing Architecture

Data Vault

Hybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer – keep everything

Bring data over as needed
− Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)

Pros and Cons

PRO
− Easily leverage existing
infrastructure
− Faster iterations between source
and solution

Especially as objects are
brought over
− Can offload historical data into
Hadoop
− Learning curve

Simple to pick up

CON
− Table joins
− Inter-dependencies between
objects
− Documentation not widely
available (outside of commercial
website and book)


1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/

Anchor Modeling

Store data how it is and how it was
− Structural changes and content changes

Created by Lars Rönnbäck

Persistent data layer – keep everything, including how the data was
structured

Highly normalized (6NF)


Documentation:

Anchor Modeling Website

Quite a few presentations, no
formal texts outside academic
papers

Pros and Cons

PRO
− Stores data and data structure
temporally
− Designed to be agile
− Reduces storage

CON
− Joins
− High normalization makes for
difficult usage

Views mask this complexity
− Some data stores aren’t able to
handle this normalization level
− BI tools aren’t designed for this
type of modeling
−

NoSQL
Volume, Velocity, Variety, Veracity

Linked Data Stores (Triple Stores)

Store data with semantic information

Created by Tim Berners-Lee

Removes/eliminates ambiguity in data

Standardizes data querying (SPARQL)

Can interface with all other linked data sources
− Public sources referenced and integrated by calling them
− Private sources work the same way, provided permissions allow

Graph data stores are a specialized type of triple store
− Store data on edges

Valerie
Arnold
Student
Teacher
enrolledIn
teaches
Class
hasFirstNam
e
Is
a
Third
Grade
hasFirstNam
e
Person
isSubClassOf isSubClassOf
Is a
Is a

SPARQL
PREFIX: school: <http://my.school.vocabulary>
SELECT ?s ?name
WHERE {
?s school:isEnrolledIn ?class .
?s school:hasFirstName ?name .
?class school:hasCourseName "Third Grade" .
?s ?name
school:Student#493 Arnold
school:Student#494 Carlos
school:Student#495 Phoebe
school:Student#496 Ralphie
school:Student#497 Wanda

Pros and Cons

PRO
− Clearly defined business logic
− Fast iterations on ontology
− Single, unified querying language
− Can join datasets via PREFIX with
no additional work

CON
− BI tools still playing catch-up
− Tool ecosystem is small

But Awesome!
− Few organizations have adopted
(but this is changing)

Other NoSQL

Columnar
− Designed with queries in mind
− Some are tuned for star schema
performance

Document Stores
− Designed with data/queries in mind
− Key-value stores

Object Stores
− Data stored as objects
− Merger of database and
programming

Others
− New types are still being created
− Watch out for flavors of the month

Data Virtualization

Integration is logical – not physical

Doesn’t matter what type of data is being integrated*
− NoSQL
− Relational

Allows for more traditionally designed tools to access more modern
data stores

Allows for easier, more iterative work flows

Business logic lives in the integration layer

Pros and Cons

PRO
infrastructure
− Faster iterations between source
and solution
− Integration between NoSQL and
RDBS simplified
− Can keep data warehouse and
augment as needed
− Uses SQL
− Self-documenting

CON
− Joining can be intensive
− Large memory, compute
requirements
− Heavy loads on source systems

Can offload to virtualization
shards

Textual Disambiguation

Take ‘unstructured’ data and interpret
context

Store disambiguated data in RDBMS
(9th
normal form)

Augment traditional data warehouse
with new unstructured data.

Pros and Cons

PRO
infrastructure
− Closes the gap between
unstructured data and traditional
data
− Clear understanding and
interpretation of unstructured data

CON
− Full language context required
− Slang, acronyms, etc. can be a
problem
− Time to delivery varies

Multiple language barrier

Defining context
− Non-agile

Hard to break data down into
smaller components

Conclusion

Business Intelligence has to move forward
− Remove legacy tools that haven’t evolved past reporting
− Tweak platform to support agile, incremental change

Businesses are already demanding more
− Faster turn around
− More access
− Deeper insights

Is your team ready to make the move?

Building next generation data warehouses

Building next generation data warehouses

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building next generation data warehouses

Ähnlich wie Building next generation data warehouses (20)

Mehr von Alex Meadows

Mehr von Alex Meadows (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building next generation data warehouses

Hinweis der Redaktion