The document discusses the modern data warehouse and key trends driving changes from traditional data warehouses. It describes how modern data warehouses incorporate Hadoop, traditional data warehouses, and other data stores from multiple locations including cloud, mobile, sensors and IoT. Modern data warehouses use multiple parallel processing (MPP) architecture for distributed computing and scale-out. The Hadoop ecosystem, including components like HDFS, YARN, Hive, Spark and Zookeeper, provide functionality for storage, processing, and analytics. Major vendors like Oracle provide technical innovations on Hadoop for data discovery, exploration, transformation, discovery and sharing capabilities. The document concludes with an overview of descriptive, predictive and prescriptive analytics capabilities in a big data value assessment.
2. AGENDA
History and Milestones
Traditional Data Warehouse
Key trends breaking the traditional data warehouse
Modern Data Warehouse
Multiple parallel processing (MPP) architecture
Hadoop Ecosystem
Technical Innovation on Hadoop
Big Data Value Assessment
2Rolta AdvizeX Confidential & Proprietary 9/11/2016
3. History and Milestones
1970’s: Relational Model Invented
1984: DB2 released, RDBMS declared mainstream
1990: RDBMS takes over
3Rolta AdvizeX Confidential & Proprietary 9/11/2016
4. The Traditional Data Warehouse
Central repository for all internal data in a
company.
Overall relational schema.
The predictable data structure and quality
optimized processing and reporting.
Data is in disk block formatting
Fundamental operation is read a row
Indexing via B-trees
Dynamic row-level locking
Data transfer usually EOD
4
6. Key Related Business and IT Trends
Emerging Technologies are disruptive by nature and play a
key role in driving digital business and the related business
trends.
Business Ecosystems enable each of the business trends,
and organizations are aggressively searching for ways to
leverage the role they play in the business ecosystem
Business Moments provide opportunities to capture value
by setting in motion a series of events and actions involving a
network of people, businesses and things that spans or
crosses multiple industries and business ecosystems.
Digital Economics seeks to harvest value from across the
business ecosystem by identifying business moments of
opportunity and exploiting the economics of connections.
This early-stage trend will have increasing importance as
business models evolve to leverage algorithmic business.
Algorithmic Business propels organizations to leverage
business algorithms to drive value in the business
ecosystem. In this early-stage trend, we are starting to see
organizations transforming data with algorithms to drive
intelligent actions, particularly with the IoT.
6
9. Modern Data Warehouse
9
Incorporates Hadoop, traditional data
warehouses, and other data stores.
Includes multiple repositories may
reside in different locations.
Includes Data from cloud, mobile
devices, sensors, and the Internet of
Things
Includes structured/semi-
structured/unstructured, raw data
Inexpensive commodity hardware in
cluster mode
10. Multiple parallel processing (MPP) architecture
Multiple parallel processing (MPP)
architecture enables extremely powerful
distributed computing and scale
Resources can be added for a near linear
scale-out to the largest data warehousing
projects.
MPP architecture uses a “shared-nothing”
There are multiple physical nodes, each
running its own instance. This results in
performance many times faster than
traditional architectures.
10
11. Apache Hadoop Ecosystem
Hadoop ecosystem
components as part of
Apache Software
Foundation projects.
The components are
categorized into file
system and data store,
serialization, job
execution, and others as
shown on the image.
11
12. Hadoop / BDD Ecosystem
Technology Purpose
Hadoop Distributed
File System
Distributed file system that provides high-throughput access to application data. Data is
split into blocks and distributed across multiple nodes in the cluster
Hadoop YARN Framework for job scheduling/monitoring and cluster resource management
Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like
language. Provides a relational view of data stored in HDFS.
HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop
Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning,
GraphX, and Spark Streaming
Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate
data prior to ingesting via data processing.
13. Technology Purpose
Oozie Oozie is the workflow scheduler system to manage Apache Hadoop jobs. BDD
uses Oozie for workflow management (sampling, profiling, enrichment).
Sqoop Tool for efficiently transferring bulk data between Hadoop and structured
datastores such a relational database
Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming
data into the HDFS
ZooKeeper Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services
Hue Hue is a set of web applications that enable you to interact with CDH cluster.
Hadoop / BDD Ecosystem
15. Oracle BDD Technical Innovation on Hadoop
15
Key Features and Functionality:
Find
• Access a rich, interactive catalog of all data in Hadoop
• Use familiar search and guided navigation to find information quickly
• See data set summaries, user annotation and recommendations
• Provision personal and enterprise data to Hadoop via self-service
Explore
• Visualize all attributes by type
• Sort attributes by information potential
• Assess attribute statistics, data quality and outliers
• Use a scratch pad to uncover correlations between attributes
Transform
• Get the data ready for analytics via Intuitive, user driven data wrangling
• Leverage an extensive library of data transformations and enrichments
• Preview results, undo, commit and replay transforms
• Test on sample data in memory then apply to full data set in Hadoop
Discover
• Join and blend data for deeper perspectives
• Compose project pages via drag and drop
• Use powerful search and guided navigation to ask questions
• See new patterns in rich, interactive data visualizations
Share
• Share projects, bookmarks and snapshots with others
• Build galleries and tell Big Data stories
• Collaborate and iterate as a team
• Publish blended data to HDFS for leverage in other tools
17. Big Data Value Assessment
17
Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past
success or failure and that is the traditional BI work.
Predictive analytics answers the question what will happen. This is when
historical performance data is combined with rules, algorithms, and external
data to determine the probable future outcome of an event or the likelihood
of a situation occurring.
Prescriptive analytics not only anticipates what will happen and when it will
happen, but also why it will happen.
Basic Analytics
Advanced Analytics
Prescriptive
Predictive
Descriptive
18. Thank You!!!
Stephen Alex
BI & Big Data Architect
(732) 485-0011(m)
9/11/201618
Rolta AdvizeX Proprietary and Confidential