Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Data Analytics In The Cloud Soa World
1. Open Source SOA in
the Cloud: Data
Analytics in the Cloud
Tom Plunkett TomPlunkett@vt.edu
Michael Sick michael.sick@serenesoftware.com
SOA World 2009
2. Overview
Unit of measure
• Who are we?
Introductions
• Baselines & definitions
• Targeted Use Cases
Opportunity • Technical convergence & opportunities
• Commercial opportunities & drivers
• State of current technology
Data Analytics Technology &
• Commercial & FOSS solutions
in the Cloud Standards
• Hadoop Focus
• Challenges to Meet Target Use Cases
Challenges • Economic challenges & the role of “free”
• Wide scale challenges in Cloud and data analytics
• Questions
Questions
• Contacts
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 2
License
3. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Introductions
Challenges
Questions
Unit of measure
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 3
License
4. Introductions
Opportunity
Tom Plunkett
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Extensive Federal Government Experience
Java and SOA Certifications
Patents
Teach OOP and Java for Virginia Tech
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 4
License
5. Introductions
Opportunity
Michael Sick
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Commercial & Federal Enterprise Architect
Owner: Serene Software Inc. – EA Services Firm
Clients include: BAE, USAF, Raytheon, BearingPoint,
McGraw-Hill, Sun Microsystems, Badcock Furniture
Fascinated by technology -15 years running
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 5
License
6. Introductions
Opportunity
Serene Software
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
• Serene is a boutique consulting company focusing on
delivery of Enterprise Architecture services and solutions
• Service Areas
– Cloud Computing
– IT Governance
– IT Strategy
– IT Cost Containment
– Service Oriented Architectures (SOA)
– IT Solution Selection
– IT Audit & Analysis
• Experience includes: BAE, USAF, Raytheon, BearingPoint,
McGraw-Hill, Sun Microsystems, Badcock Furniture, …
• Founded in 2003 (privately held, no debt) and
headquartered in Jacksonville, FL
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 6
License
7. Introductions
Opportunity
Draft NIST Definition of Cloud Computing
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure for enabling convenient, on-demand network access to a shared pool
A model
of configurable computing resources that can be rapidly provisioned and relea-
sed with minimal management effort or service provider interaction
Essential Characteristics Delivery Models Deployment Models
• On-demand self-service • Cloud Software as a Service • Private cloud
(SaaS)
• Ubiquitous network access • Community cloud
• Cloud Platform as a Service
• Location independent • Public cloud
(PaaS)
resource pooling
• Hybrid cloud
• Cloud Infrastructure as a
• Rapid elasticity
Service (IaaS)
• Measured Service
* Footnote
Source: Draft NIST Definition of Cloud Computing, 06/2009
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 7
License
8. Introductions
Opportunity
OSI Open Source Definition
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure Free Redistribution
Source Code
Derived Works
Integrity of The Author's Source Code
No Discrimination Against Persons or Groups
No Discrimination Against Fields of Endeavor
Distribution of License
License Must Not Be Specific to a Product
License Must Not Restrict Other Software
License Must Be Technology-Neutral
* Footnote
Source: http://www.opensource.org/docs/osd
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 8
License
9. Introductions
Opportunity
The Open Group SOA Definition
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Service-Oriented Architecture (SOA) is an architectural
style that supports service orientation
Service orientation is a way of thinking in terms of services
and service-based development and the outcomes of services
* Footnote
Source: http://www.opengroup.org/projects/soa/doc.tpl?gdid=10632
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 9
License
10. Introductions
Data Clouds & Data Grids – What‘s the Data Analytics
in the Cloud
Opportunity
Technology &
Standards
difference?
Challenges
Questions
Unit of measure Often Data Clouds & Data Grids are used inter-
changeably, we make the following distinctions
Data Grids Data Clouds
• Grid computing system optimized to share • Focuses on perception of infinite storage,
large amounts of distributed data computing capacity
• Focus on technical capabilities • Focus on cost, virtualization & flexible
capacity
• Often combined with computational grid
computing systems • Enables scale-up/scale-down economics
• Data often moved to compute grid for use • Data moved rarely, locality is a key feature
• Often oriented towards highly structured • Clouds thus far focusing on column
scientific data computing applications oriented, massively scalable data stores
* Footnote
Sources: Wikipedia & [Grossman 1]
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 10
License
11. Introductions
Opportunity
Definition: Mashups
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Web available resource that combines data/functions
from two or more external resources
Idea of mashup efforts is to reduce the cost of
producing and consuming resources
Integration should be fast, easy
Often focuses on widely available formats/protocols
like RSS or Atom over HTTP
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 11
License
12. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Opportunities
Challenges
Questions
Unit of measure
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 12
License
13. Introductions
Use Case: Cloud Data Analytical Tools for Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Intelligence Community Field Analyst
Challenges
Questions
Unit of measure Problem Statement: Analytical Tools Obsolete On Deployment,
field analysts need timely, configurable data analytics. How
does cloud based DA meet the needs of IC analysts
Cloud Analytical
Customer Problem Customer Value
Tools Solution
• Traditional business • Recomposable Cloud • Enabling field analysts to
intelligence tools require Computing Data Analytical quickly build the analytical
years to develop Tools tool they need to analyze
petabytes of data
• Field Analysts confront – Apache Hadoop
situations which are
– Mashups
rapidly changing
– Service-Oriented
• Petabytes of data require
Architecture
analysis
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 13
License
14. Introductions
Why the “Buzzword” Soup? Convergence Data Analytics
in the Cloud
Opportunity
Technology &
Standards
of Capabilities
Challenges
Questions
Unit of measure Convergence of capabilities
Free Open New opportunities in breadth
Source and depth of DA services
Software • Big Data: Cloud disk and data
(FOSS) storage engines make peta-
byte environments available
to new clients
• Value Based Billing: Heavy
Virtual- Cloud Data use of FOSS in the cloud
SaaS reduces costs directly &
ization Computing Analytics
indirectly
• Capacity Scaling: Scaling
up/down of capacity in pay-go
fashion makes DA available to
wider audience
Mashups • Composable UI’s: Capability
to assemble DA results into
* Footnote various interfaces
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 14
License
15. Introductions
Early Data Analytic Cloud
Opportunity
Data Analytics Technology &
in the Cloud Standards
Consumers/Providers
Challenges
Questions
Unit of measure
Profile Types Example Companies
Big Internet Companies • Yahoo, Amazon – can build DA on inf.
Internet Scale
Services
Service SaaS Companies • Force.com – DA & Warehousing to SBA’s
Providers • Facebook – sell DA access to anon. user info
Social Platforms
Insurers • BCBS – private clouds across consortium
Services
Large data-
centric Tradi- Healthcare & Biotech • Kaiser Permanente – common DA services
Cloud DA tional Co’s
Rating Agencies • S & P – open DA cloud to customers
Oppor-
tunities • CIA –private org-wide Cloud
Intelligence Community
Services
Government
Defense Managed Services • DISA -- offer DA to .mil clients
Organizations
Healthcare • SSA – offer DA to fraud prevention analysts
Services
DAaas Infrastructure • Cloudera –managed Hadoop instances
DAaaS
Providers SMB DAaaS Provider • ?? – managed DAaaS, simplified, low cost
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 15
License
16. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Technology & Standards
Challenges
Questions
Unit of measure
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 16
License
17. Introductions
Opportunity
Google MapReduce
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Algorithm for computing distributed problems using a
divide and conquer approach with a cluster of nodes
Master node Maps input into smaller sub-problems and
distributes the work to the cluster. A worker node may further
map the work for a further cluster of nodes. The worker nodes
then process the smaller problems, and return the answers back
to the master node
Master node then Reduces the set of answers into the answer to the
original problem
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 17
License
18. Introductions
Opportunity
Apache Hadoop
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Open Source implementation of the MapReduce algorithms
Hadoop can store and process petabytes of data
Subprojects include HBase, Chukwa, Hive, Pig, and ZooKeeper
Yahoo (more than 100,000 CPUs in >25,000 computers
running Hadoop) and other companies make extensive use of Hadoop
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 18
License
19. Introductions
As-Is Hadoop Simplified Reference Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Architecture
Challenges
Questions
Unit of measure
Chukwa HBase
Structured Data
Apache Hadoop
Unstructured
Zookeeper
Data
Business
ETL Pig Hive
Intelligence
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 19
License
20. Introductions
Opportunity
Apache Hadoop Sub-projects
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Hadoop Sub-
Capabilities Example Companies
projects
Chukwa • Data collection system for monitoring and • Yahoo
analyzing large distributed systems
HBase • Similar to Google’s BigTable • Yahoo
• Distributed database for structured data
• Multi-dimensional sorted map
Hive • Data warehouse infrastructure for large • Facebook
datasets
• Hive QL query language
Pig • High-level language for data analysis • Yahoo
• Compiler for Map-Reduce programs
Zookeeper • Configuration, Naming, Distributed • Yahoo
* Footnote Synchronization, and group services
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 20
License
21. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Challenges
Challenges
Questions
Unit of measure
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 21
License
22. Introductions
Opportunity
To-Be Simplified Hadoop Architecture
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
REST API
HBase
SOAP API
Business Structured
Intelligence Data
Query Apache Hadoop
Language Unstructured
Pig Chukwa Zookeeper Data
Hive
Algorithm
Library
ETL
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 22
License
23. Introductions
Opportunity
Key Challenges
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure Hardware Speed of Rack Interconnects, Multi-core
Infrastructure Parallelization Core platform, Data Analytic Components
Node Affinity Make use of super nodes, XML i/o, en/de-crypt
Cost “brutally efficient” pricing, FOSS advantages
Adoption Cost Models Accurate, open models of CapEx, OpEx costs
Migration Pain Full warehouse migration, ETL,
Ease of Admin. Parallel current RDBMS, Warehouse admin
Debugging Distributed debugging, integration w/ Provider
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration Interface, metadata optimized for ETL loading
Input & Analysis Intuitive API’s Declarative & programmatic cross language
Product Integration BI, Applications (SAP, Oracle Financial, Lawson)
Data Visualization Viewing & drill down of very large data sets
Output Intuitive API’s Declarative & programmatic cross language
* Footnote Mashups/Dynamics Easy discovery of data & functions & workflows
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 23
License
24. Introductions
Opportunity
Solutions: Projected & In-Progress
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure Hardware Interconnect $$ dropping, hardware maturing
Infrastructure Parallelization Platforms advance, market for components
Node Affinity Discovery of capability, affinity into Hadoop, …
Cost FOSS’s game to loose, small diff * a lot = a lot
Adoption Cost Models Industry standard ROI/IRR models for CC
Migration Pain Migration toolkits for traditional DW products
Ease of Admin. Integrated & extended admin packages
Debugging Commercial distributed debugging
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration ETL interface, support of popular packages
Input & Analysis Intuitive API’s SQL like interface in core, language bindings
Product Integration 3rd party adaptors, IWay et al
Data Visualization Modeling, meta-data, traceability, and new UI’s
Output Intuitive API’s SQL like interface in core, language bindings
* Footnote Mashups/Dynamics Generic datatypes, discovery services
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 24
License
25. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Questions
Challenges
Questions
Unit of measure
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 25
License
26. Introductions
Opportunity
Question? & Contact Information
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Unit of measure
Principle Architect / Partner Cloud Computing Architect
Michael A. Sick Tom Plunkett
888.777.1847 888.777.1847
michael.sick@serenesoftware.com TomPlunkett@vt.edu
Address
Address Serene Software
Serene Software 116 19th Ave. North, Suite 503
116 19th Ave. North, Suite 503 Jacksonville Beach, FL
Jacksonville Beach, FL URL: www.serenesoftware.com
URL: www.serenesoftware.com
* Footnote
Source: Source
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 26
License