Decentralized cloud an industrial reality with higher resilience by jean-pa...
The elephantintheroom bigdataanalyticsinthecloud
1. The Elephant In The Room
Big Data Analytics In the Cloud
Bill Peer, Principal, Infosys Labs
UP 2012
Cloud Computing Conference
San Francisco, California – December 12, 2012
2. What’s on the agenda?
• Definitions
• Big Data and Analytic Technologies
• Architecture Stuff
• Summary
Appendix : References
2
3. Definitions – Big Data
• Big Data – data processing scenarios wherein the volume,
variety, and/or velocity of the data is such that
conventional RDBMS and/or Data Warehouse
technologies alone do not suffice for the need
, Bill’s Stake In The Ground:
• Volume - Greater than 100 GB
• Variety - Structured and Unstructured (forms, video, blogs, photos, …)
• Velocity - 10 GB per hour
3
4. Definitions – Analytics
• Analytics – discovery of meaningful patterns in data
Bill’s Two Uses:
• Decision Support (to help make a choice)
-Business Intelligence
-Operational Intelligence
• Value Creation (to add worth)
-Algorithm Discovery
-Analytics as a Service
4
7. Big Data and Analytic Technology : 3 to Know
• Based on Google Paper published in 2004 (MapReduce)
• Can be segmented into 2 key capabilities: MapReduce and HDFS
• Designed to work in a distributed, fault possible environment
MapReduce – HDFS – Job Based!
Processing Hadoop File System
Orchestration (Reliable independent Pig Latin - Language to explore data
Framework of persistence Hive QL– SQL like calls
(Great if a problem mechanism by way of Mahout – Machine Learning collection
can be easily divided) multi-node replication)
7
8. Big Data and Analytic Technology : 3 to Know
DRILL
• Based on Google Paper published in 2010 (Dremel)
• Provides analysis of large-scale datasets
• Designed to work in a distributed environment
Query Languages- Low-Latency Apache Incubator Phase
Google BigQuery Distributed
“[Dremel] is capable of running aggregation
Execution –
queries over trillion-row tables in seconds.
Columnar centric The system scales to thousands of CPUs
storage and petabytes of data, and has thousands of
users at Google.” src: Google Dremel Paper
8
9. Big Data and Analytic Technology : 3 to Know
Storm
• Event Streaming platform used by Twitter
• Allows for continuous real-time data spelunking
• Designed to work in a distributed environment leveraging clusters
Resident Queries- Topology Centric- Event Streaming is Different
Requests for event You create graphs
of computation Storm can be used effectively to build a
patterns of interest
Complex Event Processing (CEP)
are continuously capability by an enterprise. As with other
watched for CEP type frameworks, it requires a shift to
an uncommon perspective to be effective.
9
10. A Cloud Centric Big Data NRT Architecture
CEP
Interactive
Query
*Architecture
Graphic is a
modified version of
WSO2’s BAM picture
Not Cloud In Cloud Not Cloud
10
11. Big, Big Data Analytic Architecture Consideration
• Data Transfer Speed
• Where is your data? Is it where you will be processing?
• 1TB of Data takes:
• 300 hours over a 10Mbps network
• 30 hours over a 100Mbps network
• 3 hours over a 1Gbps network
• 20 minutes over a 10Gbps network
11
13. Summary
• “approaches for near-real time Business Intelligence and Analytics”
• “Info. on technologies ranging from Hadoop to Dremel to Event Streaming “
• “applicability and limitations of these when in the Cloud”
• “high-level architectures that must be considered will be shared”
• “entertained, energized, and enlightened”
• “realistic frame of reference to bring back to their organization”
• “Journey to the Clouds”
• “Dumbo can really fly”
13
14. Feedback Forms
Please extract from your wallet
One of the feedback forms to the right
Add any commentary you have in the
White space, and hand to the
Presenter after the session
Thank you for attending!
See you in the Clouds!
14
15. References
• Big Data Spectrum, Infosys
http://www.infosys.com/cloud/resource-center/Pages/big-data-spectrum.aspx
• Dremel: Interactive Analysis of Web-Scale Datasets, Melnik et. all, Google
http://research.google.com/pubs/pub36632.html
• DrillProposal, Apache
http://wiki.apache.org/incubator/DrillProposal
• Storm Rationale
https://github.com/nathanmarz/storm/wiki/Rationale
• WSO2 BAM, wso2
http://wso2.com/products/business-activity-monitor/
15