8. RDBMS
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured
9. Hadoop
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured
10. RDBMS vs. Hadoop
• Schema
• Required on the Write
• Speed
• Reads are Fast
• Governance
• Standard and Structured
• Processing
• Limited, No Data Processing
• Data Types
• Structured
• Schema
• Required on the Read
• Speed
• Writes are Fast
• Governance
• Loosely Structured
• Processing
• Processing coupled with data
• Data Types
• Multi and Unstructured
11. Attributes IT Systems Hadoop
Data Size Gigabytes Peta/Zeta Bytes
Access Batch & Interactive Batch
CRUD Read & Write Many Times Write Once, Read Many
Times
Structure Static Dynamic
Integrity Normalization De-Normalization
Scalability Non-Linear Linear
Differences between IT Systems and Hadoop
12. A Scenario to Understand Big Data
•A Trucking Company collects… Using…???
13. A Scenario to Understand Big Data…
• GPS
• Speed
• Acceleration
• Stopping
• Normal
• To Quick
• Driving to Close to other Vehicles
15. Hadoop EcoSystem Utilization
• Flume to get raw sensor data
• Sqoop to transport data to HDFS about
• Driver
• Vehicle
• Hcatalog to have all schema definition
• Hive to analyze Gas Milage
• Pig to compute Risk Factor for each Truck Driver based on his/her
related events
• Spark to create Data Sets by applying Machine Learning
17. 17
Data Acquisition
• Input
• Multiple user event feeds (browsing activities, search etc.) per time period
User Time Event Source
U1 T0 visited Bank Site Server logs
U1 T1 searched for “Credit Cards” Search logs
U1 T2 browsed Banking Services Web server logs
U1 T3 Saw an e-Mail sent link Link advertising logs
U1 T4 Used OLTP Web server logs
U1 T5 clicked on an ad for “some insurance” Ad logs, click server logs
18. 18
Data Acquisition for the Landing Zone
Event
Feeds
User
event Normalized
Events (NE)
User
event
User
event
Project relevant
event attributes
Filter irrelevant
events
Tag and Transform
• Categorization
• Topic
• ….
HDFSUser
event
User
event
User
event
Map Operations
19. 19
Data Acquisition for the Landing Zone
• Output:
• Single normalized feed containing all events for all users per time period
User Time Event Tag
U1 T0 Content browsing Web clicks by a Bank’s user
U2 T2 Search query Category: Credit Card
… … ……. ………
... … ……. ………
U23 T23 OLTP usage Drop event
U36 T36 Bank’s site page click Category: Some product
20. 20
Feature and Target Generation for the Discovery
Zone
• Features:
• Summaries of user activities over a time window
• Aggregates, Moving averages, Rates etc. over moving time windows
• Support online updates to existing features
• Targets:
• Constructed in the offline model training phase
• Typically user actions in the future time period indicating interest
• Clicks/Click-through financial product offering and content
• Site and page visits
• Conversion events
• Deposit, Withdrawal, Quote requests etc.
• Sign-ups to newsletters, Registrations etc.
21. 21
Feature Generation for Discovery Zone
NE 1
Feature
Set
HDFSNE 4
NE 2
NE 5 NE 6
NE 3
NE 7 NE 8 NE 9
Aggregate
Normalized
events
Map 1
U1, Event 1
Map 2
U1, Event 2
Map 3
U1, Event 2
Reduce 1 Reduce 2
All events for U1
U2, Event 2 U2, Event 3 U2, Event 1
All events for U2
Summaries over
user event history
Aggregates within window
Time and event weighted averages
Event rates
……..
22. 22
Modeling Workflow within the Discovery Zone
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Model Training
Weights
Training
Phase
Target generation
Feature generation
Data Acquisition
User
event
history
Targets
Features
Evaluation
Phase
Model Scoring
Evaluation
Scores
23. 23
Batch Scoring for Discovery Results
Data Acquisition
User
event
history
Feature generation
Features
Online Serving
Systems
Model Scoring
Scores
Weights
24. 24
Discovery Zone Pipeline System Estimation
Component Data Processed Time Estimation
Data Acquisition ~ 1 Tb per time period 2 – 3 hours
Feature and Target
Generation
~ 1 Tb * Size of feature
window
4 - 6 hours
Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s of
models
Scoring ~ 500 Gb 1 hour
25. Requirements Extraction Process
• Two-step process is used for requirement extraction:
1) Extract specific requirements and map to reference architecture based on each application’s
characteristics such as:
a) data sources (data size, file formats, rate of grow, at rest or in motion, etc.)
b) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.)
c) data transformation (data fusion/mashup, analytics),
d) capability infrastructure (software tools, platform tools, hardware resources such as storage and
networking), and
e) data usage (processed results in text, table, visual, and other formats).
f) all architecture components informed by Goals and use case description
g) Security & Privacy has direct map
2) Aggregate all specific requirements into high-level generalized requirements which are
vendor-neutral and technology agnostic.
25
26. Cloud
Business Intelligence
Data Analyses
Data Cleansing
Entity Relationship Modeling
Dimensional Modeling
Database Design & Implementation
Database Population through ETL/ELT
Downstream Applications linkage - Metadata
Maintaining the processes
Source
Data
Extensive processes and costs:
Big Data Edge from Data Warehouse
Data Marts
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database
Analytical
Database
27. Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Lake
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with most ETL tools
on the market
Transport /
Messaging
Metadata Management - HCatalog
28. Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Reference Architecture
29. Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Transport /
Messaging
HCatalog – Hadoop metadata repository and management
service that provides a centralized way for data processing systems
to understand the structure and location of the data stored within
Apache Hadoop.
Extraction is an application used to transfer data, usually from
relational databases to a flat file, which can then be use to transport to a
landing are of a Data Warehouse and ingest into BI/DW environment.
Reference Architecture
Extraction
Sqoop – is a command-line interface application for transferring data between relational
databases and Hadoop. It supports incremental loads of a single table or a free form SQL query
as well as saved jobs which can be run multiple times to import updates made to a database
since the last import. Exports can be used to put data from Hadoop into a relational database.
Source
Extract Target Source Target
Sqoop
Current BI Proposed BI
sftp
Database extract
MapReduce – A framework for writing applications that processes large amounts of
structured and unstructured data in parallel across large clusters of machines in a very reliable
and fault-tolerant manner.
Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level
language (Pig Latin) for expressing data analysis programs paired with the MapReduce
framework for processing these programs.
Transformation
Landing
Staging
DW
HDFS
DM
Current BI Proposed BI
DM
MapReduce/PigComplex ETL
Complex ETL
Complex ETL
Load / Apply
Staging
DW
DM
Current BI Proposed BI
DM
Synchronization
Synchronization – The ETL process takes source data from staging, transforms using
business rules and loads into central repository DW. In this scenario, in order to retain
information integrity, one has to put in place a synchronization checks & correction mechanism.
HDFS as a Single Source – In the proposed solution HDFS acts as a single source of
data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or
inconsistent data will be reconciled with assistance of HCatalog and proper data governance.
Staging
DW
Landing
Synchronization
Source DM
HDFSSource DM
Information Integrity
Current – Currently there is no special approach to the data quality other than
imbedded into the ETL processes and logic. There are tools and approaches to
implement QA & QC.
Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA
and QC will be applied at the Data Mart Level where the actual transformations will
occur, hence reducing the overall effort. QA & QC will be an integral part of Data
Governance and augmented by usage of HCatalog.
30. Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
Data Repositories
Extraction
Transformation
Load / Apply
Synchronization
Transport /
Messaging
Information Integrity
Data Integration
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
Data Repositories
Operational
Data Stores
Data
Warehouse
Data Marts
Staging
Areas
Metadata
HDFS
HCatalog
HCatalog Metadata Management
HCatalog – A Hadoop metadata repository and management service
that provides a centralized way for data processing systems to understand
the structure and location of the data stored within Apache Hadoop.
Reference Architecture
Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file
system that allows large volumes of data to be stored and rapidly accessed across large
clusters of commodity servers
31. HCatalog Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Web Browser
Portals
Devices
(ex.: mobile)
Web Services
Access
Collaboration
BusinessApplications
Query &
Reporting
Data Mining
Modeling
Scorecard
Visualization
Embedded
Analytics
Analytics
Data Flow and Workflow
Enterprise
Unstructured
Informational
External
Data Sources
Supplier
Orders
Product
Promotions
Customer
Location
Invoice
ePOS
Other
HDFS
Analytical
Data Marts
HCatalog
Data Repositories
Sqoop
MapReduce/PIG
Load / Apply
Single Source
HCatalog & Pig
Can work with Informatica
Data Integration
Transport /
Messaging
Reference Architecture
32. Capability Current BI Proposed BI Expected
Change
Data Sources Source Applications Source Applications No
Data Integration
Extraction from Source DB Export Sqoop On-to-one change
Transport/Messaging SFTP SFTP No
Staging Area
Transformations/Load
Complex ETL Code None required eliminated
Extract from Staging Complex ETL Code None required eliminated
Transformation for DW Complex ETL Code None required eliminated
Load to DW Complex ETL, RDBMS None required eliminated
Extract from from DW,
Transformation and load to DM
Complex ETL code & process to feed DM MapReduce/Pig simplified transformations
from HDFS to DM
Data Quality , Balance & Controls mbedded ETL Code MapReduce/Pig in conjunction
with HCatalog; Can also coexist
with Informatica
Yes
Reference Architecture
34. Map Operation
MAP: Input data <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
…… Map
34
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
…
35. Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data <key, value> pair
REDUCE: <key, value> pair <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
…… Map
35
…
38. Web References
• “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and
Sanjay Ghemawat, December 2004.
http://labs.google.com/papers/mapreduce.html
• “Scalable SQL”, ACM Queue, Michael Rys, April 19, 2011
http://queue.acm.org/detail.cfm?id=1971597
• “a practical guide to noSQL”, Posted by Denise Miura on March 17, 2011 at
http://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/