SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Building Next Generation
Data Warehouses
All Things Open 2016
Alex Meadows

Principal Consultant (Data and Analytics),
CSpring Inc.

Business Analytics Adjunct Professor, Wake
Tech

MS in Business Intelligence

Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex
Agenda

(Brief) History of why data warehousing

The challenges

Three paths
− Traditional
− NoSQL
− Hybrid

Q&A


Please feel free to ask questions throughout the presentation!
Why Data Warehouses?

Started being discussed in 1970

While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model

Reporting was hard
− System/application queries were not the same as management reporting queries
Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process
Bill Inmon

Bottom-up design

Integration of source systems

Third Normal Form
Ralph Kimball

Make data accessible

Top-down approach

Dimensional models (star
schema)
Traditional Model
Traditional Model – Challenges
How can I get my
data integrated faster?
Traditional Model – Challenges
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
Traditional Model – Challenges
What about all that
“unstructured” data?
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
Traditional Model – Challenges
What about all that
“unstructured” data?
How long to get new data sources online?
How to handle business logic changes?
How can I get my
data integrated faster?
What about data
scientists?
A New Use Case

Traditional DW doesn’t meet the demand of the data science
workforce

Only gets to the ‘what happened’ and ‘why’.
Traditional
Iterations On Existing Architecture
Data Vault

Hybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer – keep everything

Bring data over as needed
− Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)
Pros and Cons

PRO
− Easily leverage existing
infrastructure
− Faster iterations between source
and solution

Especially as objects are
brought over
− Can offload historical data into
Hadoop
− Learning curve

Simple to pick up

CON
− Table joins
− Inter-dependencies between
objects
− Documentation not widely
available (outside of commercial
website and book)

1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/
Anchor Modeling

Store data how it is and how it was
− Structural changes and content changes

Created by Lars Rönnbäck

Persistent data layer – keep everything, including how the data was
structured

Highly normalized (6NF)

Documentation:

Anchor Modeling Website

Quite a few presentations, no
formal texts outside academic
papers
Pros and Cons

PRO
− Stores data and data structure
temporally
− Designed to be agile
− Reduces storage

CON
− Joins
− High normalization makes for
difficult usage

Views mask this complexity
− Some data stores aren’t able to
handle this normalization level
− BI tools aren’t designed for this
type of modeling
−
NoSQL
Volume, Velocity, Variety, Veracity
Linked Data Stores (Triple Stores)

Store data with semantic information

Created by Tim Berners-Lee

Removes/eliminates ambiguity in data

Standardizes data querying (SPARQL)

Can interface with all other linked data sources
− Public sources referenced and integrated by calling them
− Private sources work the same way, provided permissions allow

Graph data stores are a specialized type of triple store
− Store data on edges
Valerie
Arnold
Student
Teacher
enrolledIn
teaches
Class
hasFirstNam
e
Is
a
Third
Grade
hasFirstNam
e
Person
isSubClassOf isSubClassOf
Is a
Is a
RDF/XML
SPARQL
PREFIX: school: <http://my.school.vocabulary>
SELECT ?s ?name
WHERE {
?s school:isEnrolledIn ?class .
?s school:hasFirstName ?name .
?class school:hasCourseName "Third Grade" .
?s ?name
school:Student#493 Arnold
school:Student#494 Carlos
school:Student#495 Phoebe
school:Student#496 Ralphie
school:Student#497 Wanda
Pros and Cons

PRO
− Clearly defined business logic
− Fast iterations on ontology
− Single, unified querying language
− Can join datasets via PREFIX with
no additional work

CON
− BI tools still playing catch-up
− Tool ecosystem is small

But Awesome!
− Few organizations have adopted
(but this is changing)
Other NoSQL

Columnar
− Designed with queries in mind
− Some are tuned for star schema
performance

Document Stores
− Designed with data/queries in mind
− Key-value stores

Object Stores
− Data stored as objects
− Merger of database and
programming

Others
− New types are still being created
− Watch out for flavors of the month
Hybrid
Data Virtualization

Integration is logical – not physical

Doesn’t matter what type of data is being integrated*
− NoSQL
− Relational

Allows for more traditionally designed tools to access more modern
data stores

Allows for easier, more iterative work flows

Business logic lives in the integration layer
Logical Layering
Pros and Cons

PRO
− Easily leverage existing
infrastructure
− Faster iterations between source
and solution
− Integration between NoSQL and
RDBS simplified
− Can keep data warehouse and
augment as needed
− Uses SQL
− Self-documenting

CON
− Joining can be intensive
− Large memory, compute
requirements
− Heavy loads on source systems

Can offload to virtualization
shards
Textual Disambiguation

Take ‘unstructured’ data and interpret
context

Store disambiguated data in RDBMS
(9th
normal form)

Augment traditional data warehouse
with new unstructured data.
Pros and Cons

PRO
− Easily leverage existing
infrastructure
− Closes the gap between
unstructured data and traditional
data
− Clear understanding and
interpretation of unstructured data

CON
− Full language context required
− Slang, acronyms, etc. can be a
problem
− Time to delivery varies

Multiple language barrier

Defining context
− Non-agile

Hard to break data down into
smaller components
Conclusion

Business Intelligence has to move forward
− Remove legacy tools that haven’t evolved past reporting
− Tweak platform to support agile, incremental change

Businesses are already demanding more
− Faster turn around
− More access
− Deeper insights

Is your team ready to make the move?
Building next generation data warehouses

Weitere ähnliche Inhalte

Was ist angesagt?

Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
Tsendsuren Munkhdalai
 
BigData-Architecture
BigData-ArchitectureBigData-Architecture
BigData-Architecture
Narayana B
 

Was ist angesagt? (20)

Nosql database presentation
Nosql database  presentationNosql database  presentation
Nosql database presentation
 
Real time bi solution architecture
Real time bi solution architectureReal time bi solution architecture
Real time bi solution architecture
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
NoSQL
NoSQLNoSQL
NoSQL
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
 
NoSQL Architecture Pattern
NoSQL Architecture PatternNoSQL Architecture Pattern
NoSQL Architecture Pattern
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
BigData-Architecture
BigData-ArchitectureBigData-Architecture
BigData-Architecture
 
Solution architecture
Solution architectureSolution architecture
Solution architecture
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
Big Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companiesBig Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companies
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
The BI Sandbox
The BI SandboxThe BI Sandbox
The BI Sandbox
 

Ähnlich wie Building next generation data warehouses

Ähnlich wie Building next generation data warehouses (20)

Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Erciyes university
Erciyes universityErciyes university
Erciyes university
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
unit2-ppt1.pptx
unit2-ppt1.pptxunit2-ppt1.pptx
unit2-ppt1.pptx
 
the rising no sql technology
the rising no sql technologythe rising no sql technology
the rising no sql technology
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Unit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docxUnit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docx
 

Mehr von Alex Meadows

Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overview
Alex Meadows
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
Alex Meadows
 

Mehr von Alex Meadows (13)

Ethics In A Data Driven World
Ethics In A Data Driven WorldEthics In A Data Driven World
Ethics In A Data Driven World
 
SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?SIM RTP Meeting - So Who's Using Open Source Anyway?
SIM RTP Meeting - So Who's Using Open Source Anyway?
 
Continuous Integration As A Service
Continuous Integration As A ServiceContinuous Integration As A Service
Continuous Integration As A Service
 
Introduction To Analytics
Introduction To AnalyticsIntroduction To Analytics
Introduction To Analytics
 
Big Data Pitfalls
Big Data PitfallsBig Data Pitfalls
Big Data Pitfalls
 
Continuous integration with business intelligence and analytics
Continuous integration with business intelligence and analyticsContinuous integration with business intelligence and analytics
Continuous integration with business intelligence and analytics
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Open Source BI Overview
Open Source BI Overview Open Source BI Overview
Open Source BI Overview
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overview
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
Mondrian and OLAP Overview
Mondrian and OLAP OverviewMondrian and OLAP Overview
Mondrian and OLAP Overview
 
Choosing the right steps in pentaho kettle
Choosing the right steps in pentaho kettleChoosing the right steps in pentaho kettle
Choosing the right steps in pentaho kettle
 

Kürzlich hochgeladen

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

Building next generation data warehouses

  • 1. Building Next Generation Data Warehouses All Things Open 2016 Alex Meadows
  • 2.  Principal Consultant (Data and Analytics), CSpring Inc.  Business Analytics Adjunct Professor, Wake Tech  MS in Business Intelligence  Passion in developing BI solutions that provide end users easy access to necessary data to find the answers they demand (even the ones they don’t know yet!) Twitter: @OpenDataAlex LinkedIn: alexmeadows GitHub: OpenDataAlex Email: ameadows@cspring.com About Alex
  • 3.
  • 4.
  • 5. Agenda  (Brief) History of why data warehousing  The challenges  Three paths − Traditional − NoSQL − Hybrid  Q&A   Please feel free to ask questions throughout the presentation!
  • 6. Why Data Warehouses?  Started being discussed in 1970  While databases existed, they were not relational/normalized − Network/hierarchical in nature − Design for query, not for data model  Reporting was hard − System/application queries were not the same as management reporting queries
  • 7. Bill Inmon Data warehouses: subject- oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process
  • 8. Bill Inmon  Bottom-up design  Integration of source systems  Third Normal Form
  • 9. Ralph Kimball  Make data accessible  Top-down approach  Dimensional models (star schema)
  • 11. Traditional Model – Challenges How can I get my data integrated faster?
  • 12. Traditional Model – Challenges How long to get new data sources online? How to handle business logic changes? How can I get my data integrated faster?
  • 13. Traditional Model – Challenges What about all that “unstructured” data? How long to get new data sources online? How to handle business logic changes? How can I get my data integrated faster?
  • 14. Traditional Model – Challenges What about all that “unstructured” data? How long to get new data sources online? How to handle business logic changes? How can I get my data integrated faster? What about data scientists?
  • 15. A New Use Case  Traditional DW doesn’t meet the demand of the data science workforce  Only gets to the ‘what happened’ and ‘why’.
  • 17. Data Vault  Hybrid between 3NF and star schema  Created by Dan Linstedt  Persistent data layer – keep everything  Bring data over as needed − Once touching an object, bring it all over  Can be hybrid between relational databases and Hadoop  Massive parallel loading, eventual consistency (with Hadoop)
  • 18.
  • 19.
  • 20. Pros and Cons  PRO − Easily leverage existing infrastructure − Faster iterations between source and solution  Especially as objects are brought over − Can offload historical data into Hadoop − Learning curve  Simple to pick up  CON − Table joins − Inter-dependencies between objects − Documentation not widely available (outside of commercial website and book)
  • 21.  1.0 documentation found at:  TDAN Article  2.0 documentation ->  Certification/training:  http://learndatavault.com/
  • 22. Anchor Modeling  Store data how it is and how it was − Structural changes and content changes  Created by Lars Rönnbäck  Persistent data layer – keep everything, including how the data was structured  Highly normalized (6NF)
  • 23.
  • 24.  Documentation:  Anchor Modeling Website  Quite a few presentations, no formal texts outside academic papers
  • 25. Pros and Cons  PRO − Stores data and data structure temporally − Designed to be agile − Reduces storage  CON − Joins − High normalization makes for difficult usage  Views mask this complexity − Some data stores aren’t able to handle this normalization level − BI tools aren’t designed for this type of modeling −
  • 27. Linked Data Stores (Triple Stores)  Store data with semantic information  Created by Tim Berners-Lee  Removes/eliminates ambiguity in data  Standardizes data querying (SPARQL)  Can interface with all other linked data sources − Public sources referenced and integrated by calling them − Private sources work the same way, provided permissions allow  Graph data stores are a specialized type of triple store − Store data on edges
  • 28.
  • 31. SPARQL PREFIX: school: <http://my.school.vocabulary> SELECT ?s ?name WHERE { ?s school:isEnrolledIn ?class . ?s school:hasFirstName ?name . ?class school:hasCourseName "Third Grade" . ?s ?name school:Student#493 Arnold school:Student#494 Carlos school:Student#495 Phoebe school:Student#496 Ralphie school:Student#497 Wanda
  • 32.
  • 33. Pros and Cons  PRO − Clearly defined business logic − Fast iterations on ontology − Single, unified querying language − Can join datasets via PREFIX with no additional work  CON − BI tools still playing catch-up − Tool ecosystem is small  But Awesome! − Few organizations have adopted (but this is changing)
  • 34. Other NoSQL  Columnar − Designed with queries in mind − Some are tuned for star schema performance  Document Stores − Designed with data/queries in mind − Key-value stores  Object Stores − Data stored as objects − Merger of database and programming  Others − New types are still being created − Watch out for flavors of the month
  • 36. Data Virtualization  Integration is logical – not physical  Doesn’t matter what type of data is being integrated* − NoSQL − Relational  Allows for more traditionally designed tools to access more modern data stores  Allows for easier, more iterative work flows  Business logic lives in the integration layer
  • 38.
  • 39.
  • 40. Pros and Cons  PRO − Easily leverage existing infrastructure − Faster iterations between source and solution − Integration between NoSQL and RDBS simplified − Can keep data warehouse and augment as needed − Uses SQL − Self-documenting  CON − Joining can be intensive − Large memory, compute requirements − Heavy loads on source systems  Can offload to virtualization shards
  • 41. Textual Disambiguation  Take ‘unstructured’ data and interpret context  Store disambiguated data in RDBMS (9th normal form)  Augment traditional data warehouse with new unstructured data.
  • 42.
  • 43. Pros and Cons  PRO − Easily leverage existing infrastructure − Closes the gap between unstructured data and traditional data − Clear understanding and interpretation of unstructured data  CON − Full language context required − Slang, acronyms, etc. can be a problem − Time to delivery varies  Multiple language barrier  Defining context − Non-agile  Hard to break data down into smaller components
  • 44. Conclusion  Business Intelligence has to move forward − Remove legacy tools that haven’t evolved past reporting − Tweak platform to support agile, incremental change  Businesses are already demanding more − Faster turn around − More access − Deeper insights  Is your team ready to make the move?

Hinweis der Redaktion

  1. So here’s a bit about me. There are three things I’m going to ask of you, the first being – please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.
  2. The second thing I’ll ask is to be aware that some of these solutions may fix your particular problems and you’ll iterate on them and we’ll find them super-awesome and maybe you’ll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done – they are designed to be seamless and make users lives easier.
  3. The final ask before we get fully started is please don’t be the pointy-haired boss! We’re covering a lot of topics at a very high level and a lot of nuances aren’t being discussed (it’s only a 40 minute presentation after all). Please dig further and ask plenty of questions.
  4. By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.
  5. The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.
  6. Into that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.
  7. His approach, now considered a bottom-up design integrates data from various OLTP systems, tie them together in a 3NF data model and make those data sets available for reporting. This ties in with the other process that came along a bit after Inmon – the star schema.
  8. Another gentleman named Ralph Kimball took the data warehousing concept a step further. Considered a top-down approach, the data from the data warehouse is now transformed – or conformed – to match the reporting and analysis needs of the business. While many arguments were had and many organizations went pure conformed dimensional model for their data warehouse, the correct way to model is with both a 3NF backend and a star schema on top.
  9. With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I don’t have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!
  10. Speed of integrating data is a huge problem. Working to cleanse, conform, and process all those different sources into a single warehouse is one thing. Getting the business agreeing on logic and formulas to populate the star schema is also a challenge, triggering many iterations on the integration layer.
  11. It’s also a challenge to bring new sources online. Because of the nature of an Inmon data warehouse, it’s typical to bring the entire source over so that history is tracked across the entire source. In addition, how will logical changes be managed both from the source to warehouse and the warehouse to star schema? Without the 3NF layer, the star schema can’t be reloaded without losing all the history that was collected.
  12. Then of course the other question is how to handle the big yellow elephant?
  13. On top of all those other problems, we also have to address a whole new customer base – data scientists! They need to have access to data faster and more broader than any other customer base before. Yet, they can’t just access data from the data warehouse because the data is too clean to be of real use.
  14. There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two – what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics – applying the predictions found and making automated decisions based on them. Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/
  15. Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once. The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile. Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/
  16. Here is our basic example that we’ll be using through the rest of this presentation. It’s a simple student/teacher/class model that, while not modeled 100% ‘correct’, will provide a good example going forward.
  17. Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.
  18. There’s not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.
  19. Another of the new modeling techniques is anchor modeling. In this model, data is stored in a highly normalized format that focuses both on the actual data but also the context and model over time. As the data model changes, new structures can be made in the anchor model.
  20. There is only one open modeling tool that supports anchor modeling, and that’s on the anchor modeling website directly. Some commercial tools do provide support, but there aren’t many. That said, this is one of the many examples from the website. Each entity becomes an anchor and data about the entity is tied together. This model also removes duplication of data. For instance, if a teacher and a student both had the name ‘Mary’, it would only be stored once and be referenced to both anchors.
  21. There’s not a lot of documentation out in the wild, with the exception of the website and many presentations and white papers.
  22. Linked data (also known as triple stores) was created by Tim Berners-Lee around the same time as the web was created. Linked Data removes the ambiguity of typical data stores by translating the data model into a clear vocabulary. The other bonus is that there is only one single, unified querying language. When it comes to other linked data sources, it’s easy to join data sets together by adding a new prefix to a query. Graph data stores are a subtype of triple store in which data is stored in a network graph – think seven degrees of Keven Bacon.
  23. Again, using the example from before.
  24. Using that model, here we have an example of triples. A triple is made up of three parts: subject, predicate, and object. For instance, a student has a first name of Arnold. Another would be that Arnold is a Student and that a Student is a subclass of Person.
  25. RDF is another way of formatting the data in triples. Now there are other formats, but RDF/XML is one of the more common transport mechanisms since most tools can read XML. The same kinds of triples mentioned in the previous slide can be seen here.
  26. SPARQL is similar enough to SQL to be familiar but different enough to require some tutorials ;) Here we are looking at our school data (as noted by the school prefix) and retrieving all students’ first names that are in Third Grade. The WHERE clause has three triple statements to bring the result set back. Each triple is denoted by a period.
  27. There are a few books on linked data but these are two of the better of the bunch. The Manning publication is a great overview of Linked Data while the Semantic Web book focuses on building web ontologies (the vocabulary like we discussed earlier).
  28. The are many other types of NoSQL databases, but not enough time to cover here. They can still be useful in augmenting traditional data warehouses.
  29. Data Virtualization is a great way to bridge the gap between NoSQL and SQL based tools. This allows for traditional business intelligence tools to access data stores that they wouldn’t normally be able to. The cool thing about virtualization tools is that business logic lives in this integration layer – allowing for faster changes to the process that builds the data endpoint.
  30. With all the various sources, the virtualization tool will have one or many translation layers. These translation layers interpret the data between the source system and SQL. Between the initial translation layer and the final virtual data marts are any number of rules layers. These rules layers act in a similar manner to ETL (data integration) but they are inside the virtualization tool. From there, data marts can be created virtually as well. At any of those layers changes can be made quickly and will immediately impact the layers above the one where changes are made.
  31. With data virtualization, traditional tools can continue to access data marts, both virtual and real. In addition, tools that can access the source systems can go either into the virtualized layers or access the systems directly, depending on the use case/need.
  32. The final method we’ll be discussing is some of Inmon’s latest work – Textual Disambiguation. At it’s basic core, the methodology takes unstructured data and interprets it into it’s language components and defines it’s textual context. From there, the data can be stored in an even higher normalized form that we discussed with anchor modeling and augment the traditional warehouse with a veritable cornucopia of new information that can be utilized using SQL.
  33. Image Source: http://www.datatransformed.com.au/textual%20etl.htm#.WArA3RIrLRZ
  34. Image Source: https://pixabay.com/p-1014060/?no_redirect