The document discusses agile approaches to data warehousing and big data technologies. It describes traditional data warehousing as brittle and costly to modify. An agile approach uses reusable ETL modules and a hyper-normalized integration layer to flexibly adapt to changing requirements. Big data technologies like Hadoop, NoSQL databases, and cloud-based data warehouses are also discussed as enabling flexible and cost-effective options for large and evolving data and analytics needs.
3. Traditional EDW Model
• Brittle to changing requirements
Source: Agile Datawarehousing for the
enterprise
4. Two traditional
approaches
Traditional Integration layer – model it in 3NF and
upwards. ETL loads into IL before transforming it to
populate the star schema of the presentation layer.
Conformed Dimensional data warehouse skips
integration layer to load company’s data directly into star
schemas
Cons:
Both these approaches lead to DW that are very difficult
to modify once the data is loaded.
Brittle in the face of changing requirements.
Costly redesign and data conversion
5. Agile Data Engineering
Do not need to have the entire data design model
upfront
Development that adapts to changing business
requirements.
Do not need to re engineer the existing schema when
new entities and relationships arise.
Simple reusable ETL modules.
9. HNF model
18 Tables. We have hyper
normalized the table.
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
10. HNF
• Parameter driven data transform using ETL script. One
ETL module for all business key modules. Yellow ETL
module (for linking tables). Take all other attributes from
the source and send it to the target ETL tables.
• Easily adapt to BR even after loading billions of records.
12. HNF
• Caveat : Data retrieval gets complex. SQL to get data
can get very complex with outer joins and correlated sub
queries. But does it matter so much? Remember HNF is
used for Integration layer not that much presentation and
semantic layer.
• Storing the data into the integration layer from the source
systems using only 3 re usable ETL modules.
• Build DW a slice at a time and adapt to new business
requirements.
• http://www.anchormodeling.com/
13. Hyper Generalized Form
• Computer generate warehouse presentation and
semantic layers. Labor saving approach.
• Logical and physical data model eliminated
• Can operate at the business level.
• Builds on the notion of special purpose table.
• Need a acquire a automated Data ware house tool that
can generate entire data warehouse infrastructure.
• Entire dataset represented as 6 tables.
• Generate EDW and ETL Schema for all layers.
20. Big Data Technologies
Power an iterative discovery and engineering process.
Read and transform massive amount of data on cheap
commodity software using Massive parallel processing.
Schema on Read. Don’t need to impose structure of every
piece of information gathered.
Hadoop with more SQL like features or a traditional EDW with
big data packages. Which is more useful?
Complex event analysis.
Real time analytics of high volume data streams
Complex event processing
Data Mining Software
Text analytics
23. Hadoop
Reasons to use Hadoop:
Building a data ware house for the future. Gear up your skills for
Hadoop and Big Data as the data size grows larger. Major distributions
like Horton works, Cloudera, MapR have enterprise Hub editions which
can be deployed.
There is a complaint that not suitable for quick interactive querying. But
then, Cloudera’s Impala and Horton works Stinger initiative have made
interactive querying much faster.
Horton works platform provides indexing and search features using
Apache Solr which can make search and querying faster.
Horton works came up with something like Apache Zeppelin which
brings data visualization and collaboration features to Hadoop and
Spark.
Provides Apache Sqoop to load data from RDBMS.
Pig and Map Reduce for ETL
Weekly, hourly and monthly work flow schedules
Apache Flume to load web logs data.
25. NO SQL Databases.
Advantages:
Schema less read
Auto Sharding
Cloud computing (AWS)
Replication
No separate application or expensive add ons
Integrated Caching
In memory caching for high throughput and low latency
Open Source
Cassandra, Redshift, Hbase
Document based
Graph Stores – Neo4J and Giraph
Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as
BLOBs in value columns
Column wide stores – Cassandra and Hbase
26. Why Implement NOSQL?
Big Data getting bigger. New sources of data emerge
eventually.
More users are going online.
Open Source- downloaded, implemented and scaled at
little cost.
Viable alternative to expensive proprietary software
Increase speed and agility of development.
When requirements change data model also changes.
27. 5 considerations to
evaluate NoSQL
Data Model
Document model – MongoDB, CouchDB
Natural mapping of document object model to OOP
Query on any field
Graph Databases- traversing relationships is the key
Social networks and supply chain
Columnar and wide column data bases
Query Model
Consistency Model
APIs
Commercial Support & Community strength
28. DWaas
Amazon redshift
Cost effective: $1000 per terabyte per year
Columnar storage – fast access, parallelize queries
MPP DW architecture
Cheap, simple, secure and compatible with a SQL interface
Automate provisioning, configuring and monitoring of a cloud
data warehouse.
Integrations to Amazon S3, Amazon DynamoDB, Amazon
Elastic Map reduce, Amazon Kinesis.
Security is built in.
Amazon web services management console.
Network Isolation using Virtual private cloud.
29. Presentation/Visualization
Tableau
o Easy to use drag and drop interface
o No Code
o Connect to Hadoop, Cloud, SQL databases
o Offers free training
o Trend analysis, regression and correlation analysis
o In memory data analysis
o Data blending
o Clutter Free GUI
Qlik View
o Faster in memory computation
30. Analytics and forecasting
• R, Python, Apache Spark – for predictive modeling and
forecasting
• Connected the data ware house with R, Python and
Spark.
• R libraries - R part, Random Forests, ROCR, mBoost
• Python – Scikit Learn, Numpy, Pandas, Sci-py
• Spark – ML and MLlib
Hinweis der Redaktion
Interesting Questions:
Which flavor of HNF is best for each use case?
What does a physical HNF model look like?
What are best platforms to model HNF schema for performance?
How do we fold in data governance?
Where to place columns that hold applied business rules (derived columns)?
How to merge a HNF warehouse with a 3NF EDW?
Can a HNF warehouse support self service BI?
How do HNF advantages compare to Hyper generalization?
Ceregenics
Points of comparison between HNF and HGF
What do physical models look like.
How do you calculate and store derived values.
Performance and platform considerations.
Merging a new model style into existing EDWs
The source data is converted into integration layer with 6 tables which contain all the information. This can be conveniently projected into data marts and presentation layers.
Convert a drawning to things type, link types,
Latest productivity tools for data analytics such as data virtualization, data warehouse automation and big data management system offers the team a new type of application development cycle that dramatically reduces the labor required design build and deploy each incremental version of EDW
Where you make data from multiple databases be accessible through a single virtualization layer.