1. Big Data Architectural Series:
Creating a Next-Generation Big Data Architecture
facebook.com/perficient twitter.com/Perficientlinkedin.com/company/perficient
2. 2
Perficient is a leading information technology consulting firm serving clients throughout
North America.
We help clients implement business-driven technology solutions that integrate business
processes, improve worker productivity, increase customer loyalty and create a more agile
enterprise to better respond to new business opportunities.
About Perficient
3. 3
• Founded in 1997
• Public, NASDAQ: PRFT
• 2013 revenue $373 million
• Major market locations:
• Allentown, Atlanta, Boston, Charlotte, Chicago, Cincinnati,
Columbus, Dallas, Denver, Detroit, Fairfax, Houston,
Indianapolis, Lafayette, Minneapolis, New York City,
Northern California, Oxford (UK), Philadelphia, Southern
California, St. Louis, Toronto, Washington, D.C.
• Global delivery centers in China and India
• >2,200 colleagues
• Dedicated solution practices
• ~90% repeat business rate
• Alliance partnerships with major technology vendors
• Multiple vendor/industry technology and growth awards
Perficient Profile
4. BUSINESS SOLUTIONS
Business Intelligence
Business Process Management
Customer Experience and CRM
Enterprise Performance Management
Enterprise Resource Planning
Experience Design (XD)
Management Consulting
TECHNOLOGY SOLUTIONS
Business Integration/SOA
Cloud Services
Commerce
Content Management
Custom Application Development
Education
Information Management
Mobile Platforms
Platform Integration
Portal & Social
Our Solutions Expertise
5. Our Speaker
Bill Busch
Sr. Solutions Architect, Enterprise Information Solutions, Perficient
• Leads Perficient's enterprise data practice
• Specializes in business-enabling BI solutions that enable the agile
enterprise
• Responsible for executive data strategy, roadmap development, and
the delivery of high-impact solutions that enable organizations to
leverage enterprise data
• Bill has over 15 years of experience in executive leadership, business
intelligence, data warehousing, data governance, master data
management, information/data architecture and analytics
6. Perficient’s Big Data Architectural Series
Business
Case
Next
Generation
Architecture
Future Topics
• Data Integration
• Stream
Processing
• NoSQL
• SQL on Hadoop
• Data Quality
• Governance
• Use Cases &
Case Studies
Today’s
Webinar
9. “Big Data is high-volume, high-velocity and high-
variety information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.”
Convergence of structured, unstructured,
and dark data
Big Data is the evolution of data creating similar data
management issues that IT has struggled to address
for the last 20+ years.
Three Views of Big Data
10. “Big Data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”
Convergence of structured, unstructured, and dark
data
Big Data is the evolution of data creating similar
data management issues that IT has struggled to
address for the last 20+ years.
Three Views of Big Data
11. Common Big Data Business Use Cases
Improve Strategic
Decision Making
Customer
Experience
Analysis
Operational
Optimization
Risk and Fraud
Reduction
Data Monetization
Security Event
Detection and
Analysis
IT Cost
Management
12. Expanding Data Ecosystem
• Customer
Intelligence
• Operations
• Risk& Fraud
• Data
Monetization
• Strategic
Development
• Security
Intelligence
• IT Optimization
Structured Data
(5-20% of Total)
Point-of-Sale
Text Messages
Contracts &
Regulatory
Preferences &
Emotions
Security AccessWeather
Machine Data
Automobile
Mobile
Communications
Geospatial
Social
Data
Ecosystem
14. The Promise
Data Architecture Simplification
Data Integration
Data Hub
Analytics
Stream Processing
Data Warehouse
Operational Data
Hadoop
Cluster
15. The Reality
Maturity Limits the Use Cases
• Realize the potential of Hadoop
• Multi-tenancy is in its infancy
• Hadoop 2.0 and YARN
• Most third-party applications are just
moving to YARN
• Hive (and other SQL on Hadoop
solutions) maturing
• Robust enterprise functionality is
evolving
• Security
• High Availability
16. Different Types of “Open Source Hadoop”
Apache
Projects
Only
Proprietary
Value Add & Re-
Development
Apache
Projects +
Proprietary
Add-ons
Packaged and
Online Solutions
• IBM Big Insights
• Oracle Big Data
Appliance
• HDInsight
• Many others!
Choosing A Hadoop Distribution
Company Philosophy
Current Relationships
Acceptable Risk
Specialized Functionality
17. Quick Primer on YARN
What is Yarn?
• Yet Another Resource Manager
• Sometimes referred as
MapReduce 2.0
• Data operating system
• Fault-Tolerance
Why is this important?
• Enables multi-tendency on
Hadoop
• Moves processing to the data
*Image Provided by HortonWorks
22. Analytical Processing
Source Wrangle Data Model & Tune Operationalize1 2 3 4
• Data Ingestion
• Metadata
Management
• Data Access
• Data Preparation
Tools
• Data Discovery
&Visualization
• Data Wrangling
Tools
• Business Glossary
& Search
• Data Access
• Data Discovery &
Visualization
• Analytical Tools
• Analytical
Sandbox
• Business Created
Reporting
• Model Execution &
Management
• Knowledge
Management
(Portal)
Analytical
Process
Architectural
Capabilities
23. Analytical Processing
Source Wrangle Data Model & Tune Operationalize1 2 3 4
• Data Ingestion
• Metadata
Management
• Data Access
• Data Preparation
Tools
• Data Discovery
&Visualization
• Data Wrangling
Tools
• Business Glossary
& Search
• Data Access
• Data Discovery &
Visualization
• Analytical Tools
• Analytical
Sandbox
• Business Created
Reporting
• Model Execution &
Management
• Knowledge
Management
(Portal)
Analytical
Process
Architectural
Capabilities
24. Data Access
• There are many methods
to accessing Big Data
• Direct HDFS
• NoSQL / Connector
• Hive/ SQL On Hadoop
• Align tool to access
methods and file types
• Data Preparation
• Analytics Source
Files/Data
Tidy Data
Data
Preparation
Tool
Analytics
Tool
Analytical
Result
Read Access
Write Access
Key
Hadoop Cluster
26. Data Warehouse Roles
• Two models for splitting
processing
• Hot – Cold
• Data Warehouse Layer
• Push high user loads to
traditional data
warehouses
• Fully investigate DW-
Hadoop connector
functionality
• Leverage opportunity to
use in-memory
database solutions
Data Warehouse Layer Approach
Hadoop Cluster Traditional DW/DM
Hot – Cold Data Warehouse
Cold Data
Hadoop Cluster Traditional DW/DM
Hot Data
27. Data Warehouse
Organize Your Data
• Types of data stored on
cluster
• Analytical sandboxes
• Team
• Individual
• Quotas
• Potential to replace
information lifecycle
management solutions
• No right answer – clearly
define usage
Consolidated
Data
Streaming
Queues
Delta’s
(Incremental)
Common Data (Dimensions, Master Data)
Improved / Modeled Data
Published, Analytical and Aggregates
Sandbox Zone
Raw Data Processed Data
Hadoop Cluster
Archived Data
29. Stream and Event Processing
• Dedicated vs. Shared Model
• Persistence of messages, logs, etc.
• Long-term storage
• Queuing
• Pre-load (HDFS) vs. Post-load
processing
• Micro-Batch vs. One-at-a-Time
• Programing language support
• Processing guarantee
• At most once
• At least once
• Exactly once
Let business requirements drive need for streaming solutions. It is acceptable to use more
than one solution as long as the roles / purposes of each are clearly defined.
31. The Data Integration Challenge
Key Point: Hadoop and Hadoop-related technologies can address these challenges.
However, they must be architected and governed properly
Volume, variety, and
velocity create unique
challenges for data
integration
10,000+ unique entities
(or file groups) may have
to be managed
Batch windows are still
the same or shrinking
The Challenge
32. Data Factory & Integration
Hadoop Distributed
Tools
Data Integration
Packages
Hybrid (Both Hadoop
and Data Integration
Package)
• Leverages tools included in
the Hadoop Distribution and
programing languages
• Scoop, Flume, Spark, Java,
MapReduce are examples
• Tools can be implemented in
many different modes
• Hand-coded/scripted
• Runtime Configured
• Generated
• Based on use case
leverages both Hadoop and
COTs tools to move and
transform data
• Leverage commercial data
integration packages to
move and transform data
• IBM Infosphere Big Insights,
Informatica are examples
• Key questions, where is
processing taking place and
does the tool use YARN
resource manger?
Approaches to Big Data Integration
33. Define Pipelines and Stages
Sqoop
Cloud
Sources
RDBMS
File
Hub
FTP
Packaged
Tool
Object
DBMS
ETL Tool
Log
Data
FTP
Stream/
Message
Bus
Kafta
Sqoop
Storm
Extract
HDFS Load &
Formatting
Scraping&
Normalization
MCF
Storm
Cleansing ,
Aggregation
Transformation
Package
ETL Tool
Storm
Data Distribution
Data Access &
Distribution
RDBMS/DW
/IMDB
Hive
Hbase
File
Extracts
NoSQL
Stream
Output
Custom
Sqoop
Custom
Custom
Message
Bus
ETL
Tool ETL Tool
34. Big Data Integration Framework
Typical Services
Key Guidance:
• In lieu of using a ETL product, consider building a Big
Data Integration framework
• Apache Falcon provides pipeline management
• Focus is on making all components run-time
configurable with metadata
• Can offer significant cost savings over the long run
Load UtilityMetadata
Collection Metadata
Pipeline
Config
Files
Metadata
Config Files
Pipeline Utilities
Parser
(Delimiter)
Data
Standardization
HIVE
Publishing
MF Coding
Converters
File Joiner &
Transport
Logging
Checksum
Retention
Replication
Late Arriving
Data
Exception
Handling
Pipeline Master (ex. Falcon)
DB Copy
Archival
Audit
Sqoop Flume
HDFS Shell
36. SQL on Hadoop
• SQL on Hadoop is changing
• Historically focused on read
functionality for analytics
• New breed of SQL on Hadoop
• BI and operational
reporting
• Transaction Processing
*Image Provided by Splice Machine
39. Common Big Data Business Use Cases
Improve Strategic
Decision Making
Customer
Experience
Analysis
Operational
Optimization
Risk and Fraud
Reduction
Data Monetization
Security Event
Detection and
Analysis
IT Cost
Management
40. Architectural Scenarios
Architecture
Role
Business Use Case Analytics
Data
Warehouse
Stream
Processing Data Factory
Transactional
Data Store*
Strategic Decision
Making P s
Customer Experience P s P s
Operational
Optimization P s s s
Risk and Fraud
Reduction P s P
Data Monetization s s P
Security Event
Detection and Analysis P s s s
IT Cost Management P s P P
* Capability is just emerging within the Hadoop
ecosystem. Consider this use case for isolated
business cases and early adopters.
P = Primary Use Case s = Secondary Use case
41. Integrating Hadoop into the Enterprise
Determine
Business Use
Cases
Understand
Current Tools
& Architecture
Align Business
Use Case
Priorities
Build
Roadmap
Specify
Solution
Architecture
Update &
Maintain
Roadmap
Implement
Roadmap
42. Final Thoughts
Do
• Match the business use case to the big data role
• Clearly define a roadmap
• Establish clear architectural standards to drive
• Consistency
• Re-use of resources
• Homework when defining a solution architecture
Don’t
• Select an initial use case that relies on immature
Hadoop functionality
• Leverage tools that move data off the cluster for
processing then storing the data back on the cluster
• Assume all Hadoop technologies integrate well together
43. As a reminder, please submit your
questions in the chat box.
We will get to as many as possible.
44. Daily unique content
about content
management, user
experience, portals
and other enterprise
information technology
solutions across a
variety of industries.
Perficient.com/SocialMedia
Facebook.com/Perficient
Twitter.com/Perficient
45. Thank you for your participation today.
Please fill out the survey at the close of this session.