What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
1. Big Data in Practice:
A Pragmatic approach to Adoption and
Value creation
Raj Nair
Data Practitioner and Consultant
2. Application Services
• Enterprise Resource
Planning (ERP)
• eCommerce /
eBusiness
• Enterprise App Dev
and ECM
• Legacy Support,
Systems Integration
and Conversion
Info Management
• Business Intelligence
and Analytics
• Dashboards,
Scorecards, Reporting
• MDM & Data
Modeling
• Data Marts, ODS,
ETL, Data Mining
IT Infrastructure
• IT Professional
Services
• Network
Administration &
Support
• dB Admin &
Maintenance
• Hosting and
Application Support
Process & Governance
• SDLC – Agile, TDD,
TFD Iterative
• Requirements
Analysis, PMP,
Change Management
and Automated QA
• Training & Knowledge
Transition and
Technical
Documentation
3. Content NOT FOR DISTRIBUTION: Property
of Raj Nair
Object Technology Solutions Inc. (OTSI) is a leading Information
Technology (IT) Services and Solutions company founded in 1999.
Clientele of Fortune 500 companies providing IT Solutions in the areas
of SDLC, Information Management, Business Intelligence, ERP,
eCommerce (B2B, B2C), Mobile, Enterprise Solutions, Middleware and
Infrastructure.
Technology Expertise and Experience
SAP - Business Objects, ERP, Microsoft - SharePoint, .Net, SQL Server,
Project Server, IBM - WebSphere, Cognos, Rational Suite, HP - Testing
tools, PPM
Data - Oracle, DB2, SQLServer, Teradata, OS – Windows, Unix (AIX, Linux,
HP-UX) etc., Open Source, Java
Certified Diversity Supplier in KS, MO and IL
4. 1Big Data – The Original Use Case
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
5. An Open Source Engine
The Year was 2002 ….
Doug Cutting Mike Caferella
9. 1Big Data – The Original Use Case
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
10. Yes, But… We are not Google
Sears: Dynamic
Pricing
AT&T, quantifying
customer impact from
failed cell towers
Nokia: Holistic view of how
users interact with apps
across the world
Zions Bancorp:
Analyze 130 data
sources for fraud Cerner:
Detecting Health
Risks
11. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks
for storage and processing at scale
Represents Opportunity
12. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks
for storage and processing at scale
Represents Opportunity
13. Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks
for storage and processing at scale
Represents Opportunity
14. Big Data 1.0 – The Hadoop Ecosystem
Software library
Framework for large scale distributed processing
Ability to scale to thousands of computers
15. Design Principles
- Large Data Sets
Classic Hadoop MapReduce – Batch Processing
- Moving computation is cheaper than
moving data
- Hardware Failure, redundancy
16. This not “That”
Is Is Not
A Software Framework
(Storage/Compute)
A Database Management System
An appliance
Batch Processing For real-time or interaction
Write Once, Read Many Delete and Update or “ACID”
Unassuming of data formats Imposing any schemas
Open Source Lock In
Made for commodity servers
with local disks
Meant to be run in virtualized
environments
17. What is this you call data?
Unlearn current notion of “Data”
Native Data Source
18. HDFS
Storage and Archival
MapReduce
Programming Library
Crunch
Data Pipeline
processing HBase
Real time access
(low latency)
Pig
M/R Abstraction
Hive
Data Warehouse
Sqoop
Data Transfer
Flume
Data Streaming
(High
Latency)
Data Processing Workload Management
Data Movement
19. Purpose Use it for
HDFS Distributed Storage Raw data storage and archival
Flume Data Movement Continuous Streaming into HDFS
Sqoop Data Movement Data transfer from RDBMS to
HDFS/HBase
HBase Workload Mgmt Near real-time read/write access to
large data sets
Hive Workload Mgmt Analytical queries; data warehouse
Map
Reduce
Data Processing Low level custom code for data
processing
Crunch Data Processing (Java) Coding M/R pipelines, aggregations
Pig Data Processing Scripting language; similar to Crunch
20. A Powerful Paradigm
Storage Layer
Query
Engine
Processing
Engine
Metadata
Hadoop – Separate Layers
Multiple Query Engines
Data in Native format
Oracle SQL Server
Storage
Query
Storage
Query
Storage
Query
DB2
Tightly integrated Proprietary
Stacks, cannot free your data
21. 1Big Data – The Original Use Case
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
27. From Source to Business Value
Shoe-horning
Relational fit
Loading
Archiving /
Purging
Biz Rules
Validations
Scrubbing
Mapping
Transforms
Staging Distribution
Prep
Tuning
Data stores
Minutes/Hours
Subset of Data
Hours
Reliability
Sourcing
Missed SLAs = Biz Frustration
28. From Source to Business Value
Significantly more
data sources
Highly scalable,
significantly performant
data processing
New business value,
Faster time to value
35. 1Big Data – The Original Use Case
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
36. Practical Adoption
Big Data Technologies don’t solve all
problems
Leveraging existing investments
Complexities of existing systems
37. Proof of Concept
Use your own data – realistic results
Focus on very specific pain points
Know what you are going to measure
43. Exploratory BI / Analysis
Data
Storage
Makes Data exploration practically cheaper and faster
Use existing visualization tools (Tableau or other)
Check for integration with R
44. Data Architecture
• Single Important factor
• Don’t miss technology trends
But ….
It’s more about the battle plan
45. 1Big Data – The Road to Now
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
46. What about that RDBMS?
Too many new data types
Extreme demands for loading & query access
Dynamic / just in time schemas
SQL is great, but why limit to relational?
Still great for transactional workloads
50. In memory and Real Time
Spark
Storm
Apache Drill
• 100x faster than
M/R
• Event processing
• Low latency ad
hoc queries
• Interactive
queries at scale
52. 1Big Data – The Road to Now
2Mainstream Big Data
3Real World Use Cases and Applications
4Practical Adoption : Opportunity Identification
5Big Data 2.0 – What’s on the Horizon ?
6Conclusion
53. Where can I get Hadoop?
Distributors
Open Source Apache Project
And these guys…
Cloud
54. Conclusion
The Power & Paradigm of Distributed Computing
“Nativity” of Data – Unlearn old notions
Identify, understand your data processing pipeline
POC with a measurable, specific use case
Data Architecture – key to sustainable scalability
Stay informed