Supporting Financial Services with a More Flexible Approach to Big Data

Supporting Financial Services
With a More Flexible Approach
to Big Data
October 2014

Hortonworks
Fall 2014
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We Do Hadoop

Our Mission:
Power your Modern Data Architecture
with HDP and Enterprise Apache Hadoop
Who we are
June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo!
June 2014: An enterprise software company with 420+ Employees
Our model
Innovate and deliver Apache Hadoop as a complete enterprise data platform
completely in the open, backed by a world class support organization
Key Partners

Fastest growing Fortune 1000 customer base
Customer Momentum
• 300+ customers in seven quarters, growing at 75+/quarter
• Two thirds of customers come from F1000
• 100% renewal rate
Largest Cluster in North America
32,000 Nodes
Largest Cluster in Europe
1,000 Nodes
30+ customers migrated from other distributions
Some notable migrations include many of the early adopters of Hadoop:
Experience at Scale
80,000 nodes under contract
Largest Known Cluster in APAC
400 Nodes

Hortonworks: A Leader In Hadoop
The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014
“Hortonworks loves and lives
open source innovation”
Vision & Execution for Enterprise Hadoop.
Hortonworks leads with a strong strategy and roadmap for open source innovation
with Hadoop and a strong delivery of that innovation in Hortonworks Data Platform.
World Class Support and Services.
Hortonworks' Customer Support received a maximum score
and was significantly higher than both Cloudera and MapR.
Key Strategic Partnerships.
Hortonworks’ unique strategic partnerships with Microsoft, SAP, Teradata and others
are a key strength as part of its overall strategy of ecosystem partnership to
accelerate Hadoop adoption in the enterprise.

HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
HDP
• Reliable
• Consistent
• Current

Enabling a Modern Data Architecture
with HDP and Apache Hadoop
Fall 2014
Hortonworks. We do Hadoop.

Traditional systems under pressure
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
RDBMS EDW MPP
Packaged
Applications
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
New Data Types
…and difficult to
manage new data

Traditional Hadoop, challenges & limitations
MapReduce
Largely Batch Processing
1 ° ° ° ° °
HDFS
(Hadoop Distributed File System)
° ° ° ° ° N
SOURCES
EXISTING
Systems
Clickstream Web &Social Geolocation Sensor &
Machine
Server Logs Unstructured
Architectural Limitations
• Single-purpose clusters, specific data sets
• Primarily a batch system using MapReduce
Enterprise Challenges
• Limited enterprise capabilities:
Operations, Security & Governance
• Created additional Silos
Interoperability Challenges
• Difficult to natively integrate existing applications
Commercial add-ons opportunistically emerged
in the early days to address these shortcomings
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP

2006 2009
MR-279: YARN
Hadoop w/ MapReduce
MapReduce
Largely Batch Processing
1 ° ° ° ° °
HDFS
° ° ° ° ° N
Hadoop2 & YARN based Architecture
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
N °
HDFS
Siloed clusters
Largely batch system
Difficult to integrate
Hadoop 2 & YARN
Batch Interactive Real-Time
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013

HDP2 and YARN enable the Modern Data Architecture
Batch Interactive Real-Time
HDFS
Hortonworks architected and
led development of YARN
Common data set, multiple applications
• Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
• Consistent security, governance & operations
• Ecosystem applications certified
by Hortonworks to run natively in Hadoop
SOURCES
EXISTING
Systems
Clickstream Web
&Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N

A Blueprint for Enterprise Hadoop
Load data
and manage
according
to policy
PRESENTATION & APPLICATION
ENTERPRISE MGMT & SECURITY
DATA ACCESS SECURITY
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
YARN Data Operating System
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
DATA MANAGEMENT
GOVERNANCE
& INTEGRATION
OPERATIONS
Enable both existing and new applications to
provide value to the organization
Empower existing operations and
security tools to manage Hadoop
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS

HDP Delivers Enterprise Hadoop
Hortonworks Data Platform 2.2
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
Script
Pig
SQL
Hive
Tez Tez
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Linux Windows Deployment Choice On-
Premises
Cloud
YARN is the architectural
center of HDP
• Common data set across all
applications
• Batch, interactive & real-time
workloads
• Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
Enables broad
ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options
• Linux & Windows
• On-premises & cloud
Others
ISV
Engines

The Modern Data Architecture w/ HDP

Clickstream
Capture and analyze
website visitors’ data
trails and optimize
your website
Sensors
Discover patterns in
data streaming
automatically from
remote sensors and
machines
Server Logs
Research logs to
diagnose process
failures and prevent
security breaches
Hadoop Value: New Types of Data
Sentiment
Understand how
your customers feel
about your brand
and products –
right now
Geographic
Analyze location-based
data to
manage operations
where they occur
Unstructured
Understand patterns
in files across millions
of web pages, emails,
and documents

New analytic applications for new types of data
$
• Supplier Consolidation
• Supply Chain and Logistics
• Assembly Line Quality Assurance
• Proactive Maintenance
• Crowdsourced Quality Assurance
• New Account Risk Screens
• Fraud Prevention
• Trading Risk
• Maximize Deposit Spread
• Insurance Underwriting
• Accelerate Loan Processing
• Call Detail Records (CDRs)
• Infrastructure Investment
• Next Product to Buy (NPTB)
• Real-time Bandwidth
Allocation
• New Product Development
• 360° View of the Customer
• Analyze Brand Sentiment
• Localized, Personalized
Promotions
• Website Optimization
• Optimal Store Layout
Financial
Services
Retail Telecom Manufacturing
Healthcare
Utilities,
Oil & Gas
Public
Sector
• Genomic data for medical trials
• Monitor patient vitals
• Reduce re-admittance rates
• Store medical research data
• Recruit cohorts for
pharmaceutical trials
• Smart meter stream
analysis
• Slow oil well decline curves
• Optimize lease bidding
• Compliance reporting
• Proactive equipment repair
• Seismic image processing
• Analyze public sentiment
• Protect critical networks
• Prevent fraud and waste
• Crowdsource reporting for
repairs to infrastructure
• Fulfill open records requests

..to shift from reactive to proactive interactions
A shift in Advertising
From mass branding …to 1x1 Targeting
A shift in Financial Services
From Educated Investing …to Automated Algorithms
A shift in Healthcare
From mass treatment …to Designer Medicine
A shift in Retail
A shift in Telco
HDP and Hadoop allow
organizations to shift
interactions from…
Reactive
Post Transaction
Proactive
Pre Decision
From static branding …to Real-time Personalization
From break then fix …to repair before break

Data Lake: An architectural shift
SCALE
SCOPE
Unlocking the Data Lake
RDBMS
MPP
EDW
Data Lake
Enabled by YARN
• Single data repository,
shared infrastructure
• Multiple biz apps
accessing all the data
• Enable a shift from
reactive to proactive
interactions
• Gain new insight across
the entire enterprise
New Analytic Apps
or IT Optimization
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
YARN
Data Management

HDP is deeply integrated in the data center
YARN
DEV & DATA TOOLS
OPERATIONAL TOOLS
INFRASTRUCTURE
SOURCES
EXISTING
Systems
Clickstream Web &Social Geolocation Sensor &
Machine
Server Logs Unstructured
DATA SYSTEM
RDBMS EDW MPP
HANA
APPLICATIONS
BusinessObjects BI
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as Microsoft, Teradata, Redhat,
HP, SAS & SAP
Broad Partnerships
Over 600 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
HDP 2.1
Governance
& Integration
Security
Operations
Data Access
Data Management

HDP Use Cases in Financial Services
Fall 2014
Hortonworks. We do Hadoop.

Monetize Anonymous & Aggregate Banking Data
Problem
Valuable banking data needed to be anonymous & unified
• Bank possesses data that indicates larger macro-economic trends, which can be
monetized in secondary markets
• Regulations and company policies protect customer privacy
• Data sets are isolated in legacy silos controlled by LOBs
• IT challenged by joining data while guaranteeing anonymity
Solution
Cross-bank data lake for aggregate data with secure access
• Multiple data sets abstracted from source platforms
• Single point of security & privacy for de-identification, masking, encryption,
authentication and access control
• Mortgage bankers, consumer bankers, credit card group and treasury bankers have
access to the same cross-sell data
• Interoperability with partners SAS, R, RedHat & Splunk
• Economies of scale for compression & archiving data
• Significant reduction in storage costs from prior platforms
Creating Opportunity
Data: Structured,
Clickstream, Social &
Unstructured
Banking
One of the largest US banks

Insurance Data Lake to Manage Risk
Creating Opportunity
Data: Structured,
Clickstream, Server Log
Problem
Challenges merging new & old data hamper analysis
• Traditional and newer types of data were both growing quickly but were difficult to
combine in the EDW
• “Schema on load” requirements of EDW platform limited ingest of some data with
significant predictive power
• Company missed data-driven ways to serve customers
• Process of separating legitimate from fraudulent claims created “needle-in-a-haystack”
problem
Solution
Common platform for all types of data improves up-sell and reduces fraud
• “Schema on read” Hadoop architecture means that more data sources can be
easily ingested to enrich predictive analytics
• Agents use big data insights to determine the best action for valued customers and
recommend those in real-time
• Claims analysts and underwriters process streaming data to quickly flag fraud risks
and fast-track legitimate claims
Health Insurance
Large US medical insurer
>$30B in revenue
>20M members
~35K employees

Maintaining SLAs for Equity Trading Information
Problem
Meeting 12 millisecond SLAs for “ticker plant”
• Daily ingest: 50GB server log data from 10,000 feeds
• Four times daily, this data is pushed into DB2
• Applications query this data 35K times per second
• 70% of queries are for data <1 year old, 30% for >1 year old
• Current architecture can only hold 10 years of trading data
• Growing volume puts performance at risk of missing SLAs
Solution
Meeting SLAs with confidence
• HBase provides super-fast queries within SLA targets
• ETL offloading to Hadoop allows longer data retention, without jeopardizing fast
response times
Improving Efficiency
Data: Server Log & ETL
Investment
Services
Highly trafficked website
providing business and
financial information
~15K employees

Hadoop is a Platform Decision
Open Leadership
Drive innovation in the open via
the Apache community-driven
open source process
Enterprise Rigor
Engineer, test and certify
Apache Hadoop with the
enterprise in mind
Ecosystem Endorsement
Focus on deep integration with
existing data center technologies
and skills
Fastest Growing Customer and Partner Base
Largest and most experienced Hadoop adopters have standardized on Hortonworks
The data center leaders have standardized on Hortonworks

26
WANdisco Background
• WANdisco: Wide Area Network Distributed Computing
– Enterprise-ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and availability
• Leader in tools for software engineers – Subversion
– Apache Software Foundation sponsor
• Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)
• US patented active-active replication technology granted, November 2012
• Global locations
– San Ramon (CA)
– Chengdu (China)
– Tokyo (Japan)
– Boston (MA)
– Sheffield (UK)
– Belfast (UK)

28
Non-Stop Hadoop
Non-Intrusive Plugin
to Hortonworks HDP
Provides Continuous Availability
In the LAN / Across the WAN
Active/Active

3 Problems For Sharing Data Across Clusters
LAN / WAN
29

Enterprise-Ready Hadoop
Characteristics of Mission-critical Financial Applications
30
• Require Continuous Availability
– SLA’s, regulatory compliance
• Require HDFS to be Deployed Globally
– Share data between data centers
– Data is consistent, not eventual
• Ease Administrative Burden
– Reduce operational complexity
– Simplify disaster recovery
– Lower RTO/RPO
• Allow Maximum Utilization of
Resources
– Within the data center
– Across data centers

Breaking Away from Active/Passive
What’s in a NameNode
31
Single Standby
• Inefficient utilization of resources
– Journal Nodes
– ZooKeeper Nodes
– Standby Node
• Performance bottleneck
• Still tied to the beeper
• Limited to LAN scope
Active / Active
• All resources utilized
– Only NameNode configuration
– Scale as the cluster grows
– All NameNodes active
• Load balancing
• Set resiliency (# of active NN)
• Global consistency

Breaking Away from Active/Passive
What’s in a Data Center
32
Standby Data Center
• Idle Resource
– Single data center ingest
– Disaster recovery only
• One-way synchronization
– DistCp
• Error-prone
– Clusters can diverge over time
• Difficult to scale > 2 Data Centers
– Complexity of sharing data
increases
Active / Active
• DR resource available
– Ingest at all data centers
– Run jobs in both data centers
• Replication is multi-directional
– Active/active
• Absolute consistency
– Single HDFS spans locations
• ‘N’ data center support
– Global HDFS allows appropriate
data to be shared

Use Case: Disaster Recovery
34
• Data is as current as possible (no
periodic synchs)
• Doesn’t require monitoring and
consistency checking
• Virtually zero downtime to recover
from regional data center failure
• Regulatory compliance

35
• Ingest and analyze anywhere
• Analyze everywhere
– Fraud detection
– Equity trading information
– New business
– Etc…
• Backup data center(s) can be used
for work
– No idle resources
Use Case: Multi-Data Center
Ingest and multi-tenant workloads

Use Case: Heterogeneous Hardware
In-memory analytics
36
• Mixed Hardware Profiles
– Memory, disk, CPU
– Isolate memory-hungry
processing (Storm/Spark)
from regular jobs
• Share data, not processing
– Isolate lower priority
(dev/test) work

The difficulty realizing the data lake…
37

…is that data spans the entire world
38

39
Data
Ocean
Feeder
Site
Accounting
Mart
Banking
Mart
• Data Marts
– Restrict access to relevant
data
– Create quick clusters
• Feeder Sites (Data
Tributaries)
– Ingest only
Data Reservoir
Use Cases

40
• Basel III
– Consistency of data
• Data Privacy Directive
– Data sovereignty
• Data doesn’t leave country of
origin
Compliance
Regulation
Guidelines
Regulatory Compliance

Technical Comparison
Hadoop Powered by WANdisco

Multi-Data Center Hadoop Today
What's wrong with the status quo
42
Periodic Synchronization
DistCp
Parallel Data Ingest
Load Balancer, Streaming

Hacks currently in use
43
Periodic Synchronization
DistCp
• Runs as MapReduce
• DR data center is read-only
• Over time, Hadoop clusters
become inconsistent
• Manual and labor-intensive
process to reconcile differences
• Inefficient use of the network

Hacks currently in use
44
Parallel Data Ingest
Load Balancer, Flume
• Hiccups in either of the Hadoop
clusters causes the two file
systems to diverge
• Potential to run out of buffer when
WAN is down
• Requires constant attention and
sys-admin hours to keep running
• Data created on the cluster is not
replicated
• Use of streaming technologies
(like flume) for data redirection are
only for streaming

Architecture of a Non-Stop Hadoop
45

46
Question and Answer
Submit your questions in chat
Q&A

Supporting Financial Services with a More Flexible Approach to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to Supporting Financial Services with a More Flexible Approach to Big Data

Similar to Supporting Financial Services with a More Flexible Approach to Big Data (20)

More from WANdisco Plc

More from WANdisco Plc (12)

Recently uploaded

Recently uploaded (20)

Supporting Financial Services with a More Flexible Approach to Big Data

Editor's Notes