1. Beyond the buzz – what does “Big
Data” mean to your organization?
Attila Barta, Ph.D.
Head of Architecture at Private Client Group and BMO Insurance
2. 1BIG DATA WORLD CANADA 2013
Introduction to this presentation
•This presentation covers the following topics:
There is more in Big Data than Hadoop.
To understand the Big Data buzz, one has to go to the beginnings and understand the forces
that brought Big Data to life.
Is Big Data another buzz world like Semantic Web, Web 2.0 or Cloud?
Where are Canadian companies on Big Data in comparison with the World?
How a reference Big Data architecture looks like.
Big Data at BMO Financial Group.
The road ahead, what needs to be done.
•Note: this presentation reflects the opinions of the author alone and by no means of BMO Financial Group.
3. 2BIG DATA WORLD CANADA 2013
Big Data – How we got here
•In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and opportunities as
being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and
variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs"
model for describing Big Data[2]. (source Wikipedia).
•What was happening in 2001? Three major trends:
Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume
Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity
Semi-structured databases, XML native databases beside object-oriented, relational databases – variety
•What happened after 2001?
Rise of search engines and portals - Yahoo and Google:
• Problem: how to store and query (cheaply) in real time large amounts of (semi-structured) data.
• Answer: Hadoop on commodity Linux farms.
Memory got cheaper – in-memory data grids.
Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.
Increased computational power and large memory – visual analytics.
4. 3BIG DATA WORLD CANADA 2013
Big Data – Definitions and Examples
•In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery
and process optimization“[3].
• In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed
to extract value economically from very large volumes of a wide variety of data by enabling high-velocity
capture, discovery, and/or analysis”[4].
•In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5].
•Big Data Characteristics:
1. Data Volume: data size in order of petabytes.
• Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On
November 8, 2012 they announced that their warehouse grows by half a PB per day.
2. Data Velocity: real time processing of streaming data, including real time analytics.
• Example: a jet engine generates 20TB data/hour that has to be processed near real time.
3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc.
• Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year
4. Data Variability: data flows can be inconsistent with periodic peaks.
• Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
5. 4BIG DATA WORLD CANADA 2013
Big Data – In Canada, where are we?
•In December 2012 IDC published a study of Big Data in Canada [4] by surveying 75 businesses with over
250MM in revenue. The conclusions of the survey are sobering:
Less than one tenth of the respondents were familiar wit Hadoop (the Big Data framework) and slightly
more familiar with in memory data grids and in-memory analytics.
Only half of Canadian organization already work with Big Data in comparison with more than three quarters
worldwide.
The majority of Canadian companies use mainly internally produced data with less than a quarter of
Canadian organizations using data from non-traditional sources such as social media web data, RFID tags
and GPS.
Big Data strategies are delegated to mid-level management level, while world-class companies integrate
technology decisions at the executive level.
6. 5BIG DATA WORLD CANADA 2013
Big Data – What are we missing in Canada?
•McKinsey Global Institute published “Big Data: The next frontier for innovation, competition and productivity”
in May 2011. In the sectors that they examined they estimated opportunities of hundreds of billion/yearly in
savings or new businesses by unleashing the potential of Big Data [6].
•Big Data immediate business opportunities:
Transparent omni-channel information environment – an evolution of multi-channel characterized by a
seamlessly approach to the consumer experience through all available interaction channels.
Sentiment analysis – data from social media enable organizations to perceive and analyze client
sentiment in order to better tailor marketing campaigns, products and services.
Predictive models – based on real-time data streams determine likelihood to churn and take pre-emptive
actions for customer retention.
Social technologies – not only understand holistically the client (the 360-degree view), but understand the
clients network of family, friends and peers in order to build the client 720-degree view.
Location data – better understand behaviour, better offers based on location.
Operational improvement: RFI and sensor networks allows (retailers) to get insights into demand and
better manage inventory and supply chains.
7. 6BIG DATA WORLD CANADA 2013
Big Data – Reference Architecture
•Typical architectures for Big Data address the following capabilities:
1.Real-time complex event processing (including sense and response).
2.Massive volumes of data (petabytes) relational and non-relational (i.e. social media, location, RFID).
3.Parallel processing/fast loading, typically based on Hadoop.
4.High-performance query systems based on in-memory data architectures.
5.Advanced analytics, e.g. visual analytics, columnar databases.
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Shared nothing hwd,
massively parallel
Commodity;
own or rent
Massive load via
parallel processing
Data Stream
A variant of the Forrester architecture [5]
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
8. 7BIG DATA WORLD CANADA 2013
Big Data – at BMO Financial Group
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Client Omni-Channel
Interactions
Tableau, SAS
Spotfire, HANA
Tibco
BusinessEvents
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Tibco ActiveSpaces,
HANA
Sybase IQ
PaaS, IaaS
•Big Data is work in progress at BMO Financial Group with some areas more advanced then others:
Event management and in-memory data grids are state of the art.
Advanced analytics are in transition to mature.
Infrastructure virtualization is in progress.
Hadoop infrastructure not in scope yet.
Non-relational capability is in its infancy.
• Operational
• Proof of Concept
Legend
Note: the vendor list is by no means exhaustive, these are some of the technologies in use or in PoC.
9. 8BIG DATA WORLD CANADA 2013
Big Data – Capabilities at BMO Financial Group
•How the reference Big Data capabilities are reflected at BMO Financial Group:
1.Real-time complex event processing (including sense and response):
• Built a state of the art omni-channel sense and response capability based on a Tibco stack.
• Deployed real time in-bound lead management capability in 2011 that generated a significant increase
in up-sale and cross-sale – major new revenue for the Retail Bank.
2.Massive volumes of data (petabytes) relational and non-relational (i.e. social media, location, RFID):
• Data volumes manageable within the current infrastructure.
• Location data is currently available and in plan to be harvested.
• Plans on using social media data for sentiment analysis.
3.Parallel processing/fast loading, typically based on Hadoop:
• Not in plan, the current ETL investment is performing well.
4.High-performance query architecture based on in-memory data architectures:
• Running a state of the art in-memory data grid for real time event processing as well as for client 360-
degree view.
• Currently evaluating in-memory data grids for real time risk management as well as several regulatory
requirements, like Anti-Money-Laundering and Client Risk Management.
5.Advanced analytics, i.e. visual analytics, columnar databases:
• There are several advanced analytics tools in use such as Tableau and Sybase IQ, while currently
evaluating Tibco Spotfire, HANA and others.
10. 9BIG DATA WORLD CANADA 2013
Big Data – Impact on Enterprise Information Management
•Is the traditional MDM redundant?
By no means; while there are in-memory MDM implementations it rather makes sense to keep the current
investment and load to in-memory databases only subsets of MDM data, e.g. client 360-degree view or any
other data elements needed for event management, sense and response or other capabilities.
•What will happen with the current EDW?
Not much; transactional data will still be an important source for BI. However, the full power of parallel
query processing and the parallelism built into hardware should be harvested.
EDWs should be augmented with social data, location data, either directly or via service providers in order
to provide the foundation for sentiment analysis and predictive modeling.
•Are ETLs tools done?
Depends. This is the sweet spot where vendors are pitching Hadoop. Moreover, is your enterprise ready for
Hadoop? Are you ready to move to commodity hardware? Do you have the skills for both commodity
hardware and Hadoop?
•Time to retire current BI tools (e.g. Cognos, Business Objects, etc.)?
Definitely not; continue to use the current management reports and dash-boards.
Educate business on the new visual analytic tools and let them decide the way forward.
Educate business on the new BI capabilities enabled by in-memory data bases.
•However be aware of the new competitor that is building it’s Information Management from scratch and with
the proper Big Data technology might compromise your established business advantage!
11. 10BIG DATA WORLD CANADA 2013
Big Data – Organizational challenges
•What needs to be done:
In Big Data initiatives business leaders have to take the initiative. The new role of the CIO team is to
educate business in Big Data and its opportunities versus defining and leading initiatives.
CIOs have to take a holistic approach to Big Data by considering all Big Data capabilities and define
strategies accordingly, instead of focusing on some capabilities like fast ETL loading for which Hadoop is a
quick fix.
Adapt the Information Management Strategy to include behavioral oriented data, like social data, as well as
location and sensor data.
Change the BI strategy towards commoditization and massive parallel processing.
Big Data requires new skill set for handling Hadoop environments as well as in-memory data and advanced
analytics. McKinsey predicts a current shortage of more than a hundred thousand Big Data professionals in
the US alone [6].
•Last but not least:
Big Data is an evolution of many technologies around for the last decade or so. Although, with the potential
to be a technology disruptor, Big Data is rather an important augmentation to the current technologies and
if used properly it can provide significant business benefits as well as competitive advantage.
12. 11BIG DATA WORLD CANADA 2013
Thank you for your time! Questions?
attila.barta@bmo.com
13. 12BIG DATA WORLD CANADA 2013
Appendix
1. References
2. Hadoop – a Definition
14. 13BIG DATA WORLD CANADA 2013
References
1. Douglas, Laney "3D Data Management: Controlling Data Volume, Velocity and Variety". Gartner, 2001.
2. Beyer, Mark "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of
Data“, Gartner, 2011.
3. Douglas, Laney "The Importance of 'Big Data': A Definition“, Gartner, 2012.
4. Wallis, Nigel “Big Data in Canada: Challenging Complacency for Competitive Advantage”, IDC, 2012.
5. Gogia, Sanchit “The Big Deal About Big Data For Customer Engagement”, Forrester, 2012.
6. James Manika et al. “Big Data: The next frontier for innovation, competition and productivity”, McKinsey
Global Institute, 2011.
15. 14BIG DATA WORLD CANADA 2013
Hadoop – a Definition
•Apache Hadoop is an open-source software framework that supports data-intensive distributed applications,
licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity
hardware. The Hadoop framework transparently provides both reliability and data motion to applications.
•Hadoop implements a computational paradigm named MapReduce, where the application is divided into many
small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition,
it provides a distributed file system that stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node
failures are automatically handled by the framework. It enables applications to work with thousands of
computation-independent computers and petabytes of data. Hadoop was derived from Google's MapReduce
and Google File System (GFS) papers.
•The entire Apache Hadoop “platform” is now commonly considered to consist of the Hadoop kernel,
MapReduce and Hadoop Distributed File System (HDFS), as well as a number of related projects –
including Apache Hive, Apache HBase, and others.
•Hadoop is written in the Java programming language and is a top-level Apache project being built and used
by a global community of contributors. Hadoop and its related projects (Hive, HBase, Zookeeper, and so on)
have many contributors from across the ecosystem. Though Java code is most common, any programming
language can be used with "streaming" to implement the "map" and "reduce" parts of the system.
Source: Wikipedia