H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Big Data
1.
2. How much data?
800 Terabytes, 2001
60 Exabytes, 2006
500 Exabytes, 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
How much data is
generated in one day?
7 TB, Twitter
15 TB, Facebook(2009)
20 PB, Google(2008)
6.5 PB Ebay(2009)
3. A visualization
created by IBM of
Wikipedia edits. At
multiple terabytes in
size, the text and
images of Wikipedia
are a classic
example of big data.
4. Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
› Social Network, Semantic Web (RDF)
Streaming Data
› You can only scan the data once
5. Aggregation and Statistics
› Data warehouse and OLAP
Indexing, Searching, and Querying
› Keyword based search
› Pattern matching (XML/RDF)
Knowledge discovery
› Data Mining
› Statistical Modeling
6. There is no consensus as how to define
Big Data.
“Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and
process it with in a tolerable elapsed time for its user population.”
- Teradata Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze.” - The McKinsey Global Institute, 2011
7.
8. Variety
Structured and
unstructured
data: clinical
notes, audio
transcription, ima
ging, click streams
Velocity
Often time-
sensitive, data
must be
analyzed as it’s
streaming in to
maximize its
value(e.g.
patient
monitoring)
Volume
Electronic medical
records, images, digital
pathology, email, web communications
9. 1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)
2) Typically an entirely new source of data
(e.g. Use of the internet)
3) Not designed to be friendly
(e.g. Text streams)
4) May not have much values
› Need to focus on the important part
10. •Most new data sources were considered big and difficult
•Just the next wave of new, bigger data
The Past The Present The Future
11. (1) The “big” part
(2) The “data” part
(3) Both
(4) Neither
The answer is choice (4)
What is important is what the organizations do
with Big Data.
12. Decoding the human genome originally took 10
years to process, now it can be achieved in less than
a week!!!
Tobias Preis used Google Trends data to
demonstrate that Internet users from countries with a
higher per capita GDP are more likely to search for
information about the future than information about
the past. The findings suggest there may be a link
between online behaviour and real-world economic
indicators!!!
Tobias Preis and H. Eugene Stanley’s analysis
of Google search volume for 98 terms of varying
financial relevance, published in Scientific
Reports, suggests that increases in search volume for
financially relevant search terms tend to precede
large losses in financial markets!!!
13. 1. Big data can unlock significant value by making
information transparent and usable at much higher
frequency.
2. As organizations create and store more
transactional data in digital form, they can collect
more accurate and detailed performance
information on and therefore expose variability and
boost performance.
3. Big data allows ever-narrower segmentation of
customers and therefore much more precisely
tailored products or services.
4. Sophisticated analytics can substantially improve
decision-making.
5. Big data can be used to improve the development
of the next generation of products and services.
17. 1. The use of big data will become a key basis of
competition and growth for individual firms. All
companies need to take big data seriously.
2. The use of big data will underpin new waves of
productivity growth(growth by 60% possible) and
consumer surplus.
3. The computer and electronic products and
information sectors, as well as finance and
insurance, and government are poised to gain
substantially from the use of big data.
4. By 2018, the United States alone could face a
shortage of 140,000 to 190,000 people with deep
analytical skills as well as 1.5 million managers
and analysts with the know-how to use the
analysis of big data to make effective decisions.
18. Will be so overwhelmed
› Need the right people and solve the right
problems
Costs escalate too fast
› Isn’t necessary to capture 100%
Sources of big data may be
private
› Self-regulation
› Legal regulation
19. Very strong assumptions are made about
mathematical properties that may not at all reflect
what is really going on at the level of micro-
processes.
Even as companies invest eight- and nine-figure sums
to derive insight from information streaming in from
suppliers and customers, less than 40% of employees
have sufficiently mature processes and skills to do so.
The decisions based on the analysis of Big Data are
inevitably "informed by the world as it was in the
past, or, at best, as it currently is“.
If the systems dynamics of the future change, the
past can say little about the future. For this, it would
be necessary to have a thorough understanding of
the systems dynamic, which implies theory.
20. The biggest value in big data can be driven by
combing big data with other corporate data
Big
data
Other
data
Create a
synergy
effect
21.
22. Banking industries were very hard to
handle even a decade ago
“BIG” will change
› Big data will continue to evolve