Weitere ähnliche Inhalte Ähnlich wie Big data by_mcal (20) Kürzlich hochgeladen (20) Big data by_mcal2. Our Goal for Today
1. Evolution of digital data over the decades
2. Why do we process data – and how?
3. How all this has been changing in the last decade?
4. What is Big Data and how to handle it?
5. Who needs to understand Big Data?
6. What are the Big Data related opportunities?
7. Discussions and Q&A
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
2
4. Bits, Bytes, and Beyond
Name
Value
Example
Bit
A BIT !!
Byte
8 Bits
1 Character
Kilobyte
1024 (1K) Bytes
About 150 words
Megabyte
1K Kilobytes
A small book
Gigabyte
1K Megabytes
20 GB = All of Beethoven’s work
Terabyte
1K Gigabytes
1000 copies of Encyclopedia
Britannica
Petabyte
1K Terabytes
500 billion pages of standard printed
text
Exabyte
1K Petabytes
5 EB = All words ever spoken by
mankind
Zettabyte
1K Exabyte
1 ZB = Entire planet’s digital content
Yottabyte
1K Zettabye
1 YB = will take 11 Trillion years to
download!
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
4
5. History of Data Storage Capacity
1956
Hard Drive from IBM : 5 MB
1963
Audio Tape : 663 KB
1970
Floppy Disk : 80 KB
1976
Floppy Disk : 110 KB
1981
Floppy Disk : 1.4 MB
1982
CD : 700 MB
1995
DVD : 4.7 GB
2003 BLU RAY : 25 GB
Hard Disks : Multi Terabyte
WWW & CLOUD
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
5
6. Cost Per Gigabyte
YEAR
COST / GB
1980
$ 3,000,000
1990
$ 8,000
2000
$ 30
2010
$ 0.08
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
6
7. Prior to the 80’s
E-commerce did not exist.
Data entry, storage, and processing were sequential
processes – and displaced in time.
Data was processed on monolithic computers running on
mainframes.
Batch processing was the norm.
Data processing was used in non-time-critical areas such as
payroll and accounting.
Only large enterprises and institutions could afford data
processing.
Data processing could only support long term analysis and
decision making processes – such as planning.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
7
8. Prior to the 80’s…
Data was largely STRUCTURED
Managerial Leadership and Team
8
11. Data Processing in the 80’s and Before
Data creation was a controlled process.
Rate of data creation was known and manageable.
Data creation and processing : Co-located.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
11
12. Database Systems of the 80’s and Prior
Navigational
Relational
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
12
13. In the 90’s
Better connectivity allowed data to be collected from
distributed, but finite sources.
Data created was directly captured and stored online.
Online Transaction Processing (OLTP) systems emerged.
Data processing could now support operational decision
making since data capture and processing could be done real
time.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
13
14. In the 90’s Cont...
Data creation was still a controlled step and data was
structured.
Volumes of data generated was manageable.
Data processing was still centralized.
Relational Databases ruled the world of data processing.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
14
17. Early Years of Internet
Internet enabled
e-commerce
B2B Transactions
B2C Transactions
Banking and Finance
Travel and Hospitality
Retail
Health Care
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
17
18. Early Years of Internet Cont...
Volume of online transactions rapidly increased.
Database systems had to separate online processing from
analysis to cope with the transaction volume.
Data Warehousing emerged.
Distributed databases also made their appearance.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
18
19. Early Years of Internet Cont...
In the early days, the processed data was still structured
since it dealt with e-commerce transactions.
The need was for systems that focused on transactions:
validation and recording.
Consequently, transaction and analysis systems had to be
separated.
ETL (Extract Transform Load) processes managed data
conversion from one form to another (transaction
analysis).
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
19
20. In the New Millennium
Rapid adoption of Internet.
Explosion of e-commerce : Especially B2C.
The Internet enabled customers to seek out the best deal.
Businesses had to proactively entice customers.
• To consume their products and services.
• At the point of purchase.
Data processing moved from playing a supportive role to a
“Business Critical” role.
• Nature of certain businesses completely changed.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
2
0
21. Then Came SOCIAL NETWORKING and MOBILITY
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
21
22. Impact of Social Networking
Success of B2C business transactions now depends on the ability to
analyze customers’ past and current behaviour real-time!
Social Networking has become a source of valuable information to
understand customer choice and behaviour.
Social Networking
=
Unstructured Data
Social Networking
=
Extremely large data generation rates
Social Networking
=
Highly distributed
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
2
2
27. Very High Data Creation Rates
Year
Data Estimate
2002
5 Billon GB
2006
161 Billion GB
2010
1277 Billion GB
2015
7910 Billion GB
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
27
28. The Situation Today…
Every two days now we create as much information as we did
from the dawn of civilization up until 2003.
- Erik Schmidt, GOOGLE
Structured Data constitutes only 5% of the
total “Data Deluge”.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
28
29. Business Processes – Then and Now
Then
Now
Anticipate product / service
need
Anticipate product / service
need
Marketing
Marketing
Sales
Sales
Transaction
Transaction
Analysis
Analysis
Refinement
Refinement
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
29
30. Who Needs Rapid Data Analysis
Banking and Finance
Credit / Debit / ATM card transactions
• Collaboration between banks
• Fraud detection
• Real-time analysis of CCTV to detect and prevent ATM
attacks
Credit / Loan approval
• Credit analysis based on credit history as well as social
network traces
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
30
31. B2C ecommerce Sites (Online Stores)
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
31
32. B2C – Product Comparison Sites
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
32
33. Data Analysis in Elections
The last USA elections
Data-driven decision making played
a huge role in creating a second
term for the 44th President and will
be one of the more closely studied
elements of the 2012 cycle.
Time: Nov 10, 2012
Obama Election Head Office - Chicago
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
33
34. Crime Investigation / Prevention / Surveillance
Processing of email / chat / phone call traces
• Accessed by Govt. agencies
Processing of Facebook / Twitter posts / Chats
• Sentiment analysis for crime prevention
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
34
35. Common to All These Situations…
•
UNSTRUCTURED data.
•
Very large data sets – dynamic and rapidly increasing by
the minute.
o Terabytes of Data (BIG DATA)
•
Highly dispersed and distributed data generation.
•
Impossible to move such data to a central location for
processing.
•
At the same time, very critical to process data and
generate results real-time.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
35
36. Characteristics of New Age Data Processing
Systems
Ability to handle unstructured data.
Ability to handle rapidly increasing volumes of data.
Ability to operate on distributed data sets.
Scalable.
Reliable/Fault tolerant.
Reasonable costs - one time & operational.
These requirements have led to increasing interest in BIG
DATA the development of newer Data Storage & Analysis
Techniques.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
36
37. Growing Interest in Big Data
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
37
40. Data Models and Database Systems Over the Years
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
40
41. History of Data Models and Database Systems
MAP REDUCE,
COLUMNAR DATABASES
& NO-SQL DATABASES
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
41
42. How to Tackle Big Data – In Simple Words
1. Break down the problem into manageable chunks.
2. Spread the data and its processing it over a number of nodes
– typically cheap computers.
3. Manage the process to ensure that nothing gets lost.
4. Re-assemble the answer from the various parts to get your
query answered.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
42
43. Map – Reduce : Technique to Handle BIG DATA
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
43
44. Map – Reduce : Technique to Handle BIG DATA Cont...
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
44
45. The Map – Reduce Technique
Advantages
Drawbacks
Can handle both, structured and
unstructured data.
Not very easy to setup and use.
Can scale up with data size.
Raw Map - Reduce requires
programming to set up.
Open source implements
available: Reasonable costs.
Basic Map - Reduce suitable
largely for batch processing.
• (Real time techniques have
now been implemented to
overcome this drawback).
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
45
46. Hadoop
Based on the Map-Reduce distributed processing architecture.
A task is mapped to a set of servers for processing.
Results from the servers are then reduced down to a singe set.
Hadoop operates on the HDFS distributed file system.
- HDFS ensures data redundancy.
Hadoop has in-built task management functionality to ensure
reliability.
Interfaces available with other components: Open Systems and
commercial.
Highly scalable and cost effective.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
46
47. HDFS
HDFS
Hadoop Distributed File System
Goals (Ref: Nortonworks)
• Store Petabytes of data.
• Keep per node costs down to afford more nodes (scalability).
• Commodity x86 servers, Open Source software.
• Support computation in each server.
• Handle failures: Failures treated like noise – inevitable.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
47
49. Big Data Analysis – The Big Picture!
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
49
50. Components Relevant to Hadoop
Hbase
Database to store data and speed up queries.
Hive
Warehouse implementation to support Analytics, Query and
Visualization.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
50
51. HBase
HBase is a Columnar, NoSQL database system.
HBASE
RDBMS
Column oriented
Row oriented
Flexible schema, add columns on
the fly
Fixed Schema
Good with sparse tables (partially Not optimized for sparse tables
filled)
No query language
SQL
Wide tables
Narrow tables
Joins using Map – Reduce
Optimized for joins
Tight integration with Map
Reduce
Not integrated (usually) with MR
Horizontal scalability – just add
hardware
Hard to scale and size down
Good for semi-structured &
structured
Good only for structured data
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
51
52. Hive
•
Hadoop can get difficult to configure and use!
•
Hive sits between Hadoop and the users of Hadoop.
•
It provides a familiar – TABLE like – environment for
dealing with Hadoop.
•
It allows Data to be:
o Read from Hadoop / HDFS
o Written into Hadoop / HDFS
o Queried from Hadoop / HDFS using the much
familiar SQL like syntax
•
In the background, Hive efficiently converts all queries
into efficient MAP – REDUCE tasks.
•
Hive is a Data Warehouse system for Hadoop.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
52
53. HBase v/s Hive
HBase
Hive
Typically used for unstructured
data and sparse tables.
Typically used as a Data
Warehouse.
Allows low latency random data
access.
Main purpose is analysis and adhoc querying.
Main purpose is continuous
operations such as accepting
data feeds and committing them
to HDFS.
Deals with Structured Data
resulting from analysis of data
stored in HDFS.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
53
54. Pioneers of Big Data
eBay
In excess of 2500 computing cores
Yahoo
In excess of 4000 nodes
Facebook
More than 23,000 nodes
Google
?? (24 Pb of data/day)
LinkedIn
??
Source: Slide by Ian Brown
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
54
55. Big Data Solution Suppliers
Informatica
EMC
Oracle
IBM
Microsoft
Teradata
Amazon
Cloudera
Apache
Google
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
55
56. Who Uses Big Data (2011)
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
56
57. Case Study : redBus.in
•
redBus.in : Internet based bus ticket booking
•
Handles more than 10,000 routes
•
Goal
o To capture each and every event happening on their
website & co-relate them
o To identify if booking failures were due to absence of
supply, or due to server problems
o To understand which routes needed more buses
•
Volume of data: 500 GB
•
Expected response time: Less than 1 minute
•
Tool / service used : BigQuery from Google
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
57
58. Case Study : Seagate
•
Seagate : Has manufactured more than 2 Billion hard drives
•
They maintain data comprising:
o Information related to the 2 Billion hard drives
o Manufacturing information
o Supplier information
o Customer information
•
400 GB of data added per day to the Warehouse
•
Used Big Data techniques to analyze Test Data
•
Impact : Overall improvement in quality due to sharp
identification of process and supplier issues
•
Tools used: Not known
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
58
59. Case Study : Macy’s
•
They want to prevent an overload of irrelevant promotions going to
their customers.
•
They are sending fewer, more focused messages to individual
clients about products and special offerings that have a high
likelihood of being appealing to that person.
•
They are combining point-of-sale information with
o online browsing behaviours
o response to emails
o social media activity
o and more …
•
To get a 360-degree view of each customer.
•
The result: fewer, more meaningful interactions with customers
that drive greater loyalty, greater revenues, and lower churn.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
59
60. Other Applications of Big Data
Epidemic prediction
Weather predictions
Scientific experiments generating very large amount of data such as the Super Collider.
Astronomy
Search for extra terrestrial intelligence
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
60
61. Big Data Challenges
Hadoop and Big Data Technologies are time consuming to
set-up and use.
Building and running Hadoop jobs is non-trivial.
Running and analyzing queries and results does not leverage
existing skills.
Requires special teams to initiate in an organization – along
with associated costs.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
61
62. Who Should Know About Big Data
Decision Makers
To understand its capabilities and how to use it for Business gains.
Data Scientists
To be able to understand and apply the right techniques to solve
Big Data problems.
Big Data Applications Developers
To know the building blocks, and nuts and bolts of putting
together a Big Data processing system.
Big Data Analysts
IT Stafff
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
62
63. Big Data Macro Trends
•
Information generation growing 2 times faster than
storage capacity.
•
Growth in data collection: 60% CAGR.
•
Information Management industry:
o Sized at $100 Billion
o Growing at 10% CAGR
•
Big Data sources are becoming more varied.
o Mobile phones, sensors, etc.
•
Total Internet traffic will exceed 667 Exabytes by 2013.
•
Third party data availability is on the rise.
•
Hadoop is the fastest growing Big Data : Downloads have
increased more than 400% in the last two years.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
63
64. Big Data Market Size Projection
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
64
65. Future of Big Data
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
65
66. Career Opportunities
Internet products and services companies
Manufacturing companies
Banking and finance
Pharma
Govt Departments
Direct
Opportunities
•
•
•
•
•
Indirect
Opportunities
• Handling outsourced Big Data analysis and
development projects for the above
organizations.
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
66
67. Structured v/s Unstructured Data
Unstructured Data
1.
Structured Data
Web server and search engine
logs (“data exhaust”)
Customer databases
Logs from other types of servers
2. (e.g., telecom switches and
gateways)
3.
E-Commerce / Web Commerce
records
Legacy BI/ CRM/ ERP systems
Inventory and Supply Chain
4. Social Media / Gaming messages
5.
Multimedia – voice, video,
images
6.
Sensor data / M2M
communications
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
67
68. Structured v/s Unstructured
Structured
Unstructured
Discrete (rows and
columns)
Binary large objects: Lessdefined boundaries, less-easily
addressable.
Small discrete objects:
Information represented for a very
specific purpose (e.g., SMTP Mail
Msg.).
Storage/Persistence
DBMS or file formats (e.g.,
VSAM).
Unmanaged, file structure or
content repository.
Metadata Focus
Syntax (e.g., location and
format).
Semantics (descriptive and other
markup).
Integration Tools
ETL or ELT, Enterprise
Information Integration via
BizTalk and Batch
Processing.
Batch Processing, Manual data
Entry, Custom solutions that involve
a lot of code.
Standards
SQL (and its multiple
Open XML, SMTP, SMS, CSV and
variations), ADO.Net, ODBC Information and Content Exchange.
and many RDBMS support
XML as another option.
Representation
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
68
69. Evolution of Data Transfer Rates
Medium
Transfer Rate
Modems
56 Kilobits / Second
T-1 Line
1.544 Megabits / Second
Ethernet
10 Megabits / Second
Fast Ethernet (LAN)
100 Megabits / Second
1 Gigabits / Second
T-3
44.736 Megabits / Second
Optical Fibres
Upto 20 Gigabits / Second (Dedicated)
Next Internet Backbone
2.4 Gigabits / Second
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
69
71. Prior to the 80’s
E-commerce did not exist
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
71
72. Prior to the 80’s
Data entry, storage, and processing were sequential
and displaced in time
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
72
73. Prior to the 80’s
Data was processed by monolithic applications
running on mainframes
Batch processing was the norm
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
73
74. Prior to the 80’s
Data processing was used in non-time-critical areas
such as payroll, accounting
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
74
75. Prior to the 80’s
Only large enterprises and institutions could afford
the cost of processing data
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
75
76. Prior to the 80’s
Data processing could only support long term analysis
and decision making processes – such as planning
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
76
77. Hadoop – Value Adding Projects/Products
Hadoop
1. HBase
2. Cassandra
3. Mongo
4. CouchDB
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
77
78. Projects/Products Adding Value to Hadoop
The standard Hadoop database, an open-source, distributed, versioned,
column-oriented store, providing Bigtable-like capabilities over Hadoop.
HBase
Cassandra
HBase includes base classes for backing Hadoop MapReduce jobs; query
predicate push; optimizations for real time queries; a Thrift gateway and
a REST-ful web service to support XML, Protobuf, and binary data
encoding: an extensible JRu-by-based (JIRB) shell; and support for the
Hadoop metrics subsystem. Like Hadoop, HBase is an Apache project,
hosted at http://hbase.apache.org/
Apache Cassandra is a highly scalable second-generation distributed
database, bringing together Dynamo’s fully distributed design and
Bigtable's ColumnFamily-based data mode. The Cassandra project lives
at http://cassandra.apache.org/
A good example of using Cassandra together with Hadoop lies in the
Datastax Brisk platform - learn more at http://www.datastax.com/
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
78
79. Projects/Products adding value to Hadoop Cont…
Mongo
An open source, scalable, high-performance, schema-free, documentoriented database written in C++. The MongoDB project is hosted at
http://www.mongodb.org/.
To use Mongo and Hadoop together, check out
https://github.com/mongodb/mongo-hadoop
Apache CouchDB is a document-oriented database supporting queries
and indexing in a MapReduce fashion using JavaScript.
CouchDB
CouchDB provides APls that can be accessed via HTTP requests to
support web applications. Learn more at http://couchdb.apache.org/
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
79
80. Big Data Applications: Additional Ideas
Balance Sheet Analysis
Manufacturing Data Analysis
Production Systems Diagnostics and Pattern Identification
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
80
81. Thank you for your attention!
Please ask questions, if any!
Copyright ©2013. MindMap IT Solution (P) Ltd. All right reserved.
8
1