In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
26. 26
WANdisco Background
• WANdisco: Wide Area Network Distributed Computing
– Enterprise-ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and availability
• Leader in tools for software engineers – Subversion
– Apache Software Foundation sponsor
• Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)
• US patented active-active replication technology granted, November 2012
• Global locations
– San Ramon (CA)
– Chengdu (China)
– Tokyo (Japan)
– Boston (MA)
– Sheffield (UK)
– Belfast (UK)
28. 28
Non-Stop Hadoop
Non-Intrusive Plugin
to Hortonworks HDP
Provides Continuous Availability
In the LAN / Across the WAN
Active/Active
29. 3 Problems For Sharing Data Across Clusters
LAN / WAN
29
30. Enterprise-Ready Hadoop
Characteristics of Mission-critical Financial Applications
30
• Require Continuous Availability
– SLA’s, regulatory compliance
• Require HDFS to be Deployed Globally
– Share data between data centers
– Data is consistent, not eventual
• Ease Administrative Burden
– Reduce operational complexity
– Simplify disaster recovery
– Lower RTO/RPO
• Allow Maximum Utilization of
Resources
– Within the data center
– Across data centers
31. Breaking Away from Active/Passive
What’s in a NameNode
31
Single Standby
• Inefficient utilization of resources
– Journal Nodes
– ZooKeeper Nodes
– Standby Node
• Performance bottleneck
• Still tied to the beeper
• Limited to LAN scope
Active / Active
• All resources utilized
– Only NameNode configuration
– Scale as the cluster grows
– All NameNodes active
• Load balancing
• Set resiliency (# of active NN)
• Global consistency
32. Breaking Away from Active/Passive
What’s in a Data Center
32
Standby Data Center
• Idle Resource
– Single data center ingest
– Disaster recovery only
• One-way synchronization
– DistCp
• Error-prone
– Clusters can diverge over time
• Difficult to scale > 2 Data Centers
– Complexity of sharing data
increases
Active / Active
• DR resource available
– Ingest at all data centers
– Run jobs in both data centers
• Replication is multi-directional
– Active/active
• Absolute consistency
– Single HDFS spans locations
• ‘N’ data center support
– Global HDFS allows appropriate
data to be shared
34. Use Case: Disaster Recovery
34
• Data is as current as possible (no
periodic synchs)
• Doesn’t require monitoring and
consistency checking
• Virtually zero downtime to recover
from regional data center failure
• Regulatory compliance
35. 35
• Ingest and analyze anywhere
• Analyze everywhere
– Fraud detection
– Equity trading information
– New business
– Etc…
• Backup data center(s) can be used
for work
– No idle resources
Use Case: Multi-Data Center
Ingest and multi-tenant workloads
36. Use Case: Heterogeneous Hardware
In-memory analytics
36
• Mixed Hardware Profiles
– Memory, disk, CPU
– Isolate memory-hungry
processing (Storm/Spark)
from regular jobs
• Share data, not processing
– Isolate lower priority
(dev/test) work
39. 39
Data
Ocean
Feeder
Site
Accounting
Mart
Banking
Mart
• Data Marts
– Restrict access to relevant
data
– Create quick clusters
• Feeder Sites (Data
Tributaries)
– Ingest only
Data Reservoir
Use Cases
40. 40
• Basel III
– Consistency of data
• Data Privacy Directive
– Data sovereignty
• Data doesn’t leave country of
origin
Compliance
Regulation
Guidelines
Regulatory Compliance
42. Multi-Data Center Hadoop Today
What's wrong with the status quo
42
Periodic Synchronization
DistCp
Parallel Data Ingest
Load Balancer, Streaming
43. Multi-Data Center Hadoop Today
Hacks currently in use
43
Periodic Synchronization
DistCp
• Runs as MapReduce
• DR data center is read-only
• Over time, Hadoop clusters
become inconsistent
• Manual and labor-intensive
process to reconcile differences
• Inefficient use of the network
44. Multi-Data Center Hadoop Today
Hacks currently in use
44
Parallel Data Ingest
Load Balancer, Flume
• Hiccups in either of the Hadoop
clusters causes the two file
systems to diverge
• Potential to run out of buffer when
WAN is down
• Requires constant attention and
sys-admin hours to keep running
• Data created on the cluster is not
replicated
• Use of streaming technologies
(like flume) for data redirection are
only for streaming
Hortonworks approach is quite clear… we are focused on delivery of enterprise grade Hadoop as a reliable data platform that will enable your transition to a modern data architecture. To this end, we work solely within the broad open source community with a focus on innovation at the core of Apache Hadoop with YARN as a foundation and then within all the related projects that deliver on the key requirements for the enterprise such as governance, security and operation.
However, I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.
What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search.
Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical.
The result was the Apache Hadoop project that delivered large scale storeage (HDFS) and processing (MapReduce). Yahoo soon committed to this open source approach as they understood that rather than locking a few guys in a room to work on it, they could work within the Apache Software Foundation so that others would pick it up, progress it, and contribute it back to the community and thereby greatly accelerate progress for all.
And this is exactly what happened: all of the leading consumer web companies began to use and advance it, to the point that by 2011, Hadoop underpinned every click at Yahoo, and their infrastructure had reached 35,000 nodes.
Soon, mainstream IT started to look closely at Hadoop as a way to address the architectural challenge faced by the explosion of data that every organization was experiencing as mobile, social and machine generated data began to accelerate.
It was at this point with an objective of facilitating broader market adoption, the core Hadoop team left to form Hortonworks – with the blessing of Yahoo – and with a singular goal: to progress Hadoop into Enterprise Hadoop – a complete open source data platform that enables a modern data architecture.
Since our incepetion just three years ago, we have grown to more than 450 employees and have partnered closely with the leaders in the datacenter, all of whom share this vision: to enable a modern data architecture with Hadoop in order to allow their customers to address the architectural challenge that they all are facing due to exploding data volumes.
[note: if useful as a talk track, Doug Cutting was hired by the team at Yahoo in the early days based on some prototype work he’d done and he left to form Cloudera in 2008 well BEFORE Hadoop was running at scale inside of Yahoo]
In the past few years, we have seen phenomenal momentum behind Hadoop and Hortonworks: we began shipping the Hortonworks Data Platform in Q3 of 2012 (only 8 quarters ago), and since, we have partnered with more than 350 customers as their Hadoop provider of choice, 2/3 of which are in the Fortune 1000.
The most interesting aspect for me: the quantity of early adopters that began their Hadoop journey with an alternative distribution (since Cloudera had been in the market 3 years before Hortonworks was formed) that have migrated over to partner with Hortonworks for Hadoop.
These are the very largest users on the planet, who having gotten past their initial forays with the tech really now understand what they want from their Hadoop vendor and have migrated en masse to Hortonworks.
Why? I’ll talk about that a bit more detail, but at a high level:
- Open leadership: Hortonworks engineers literally wrote most of the code that needs to be supported and is leading the innovation in the community
- Enterprise Rigor: we apply enterprise software rigor to the build, test and release process from the work done in the open source community
- Ecosystem endorsement: the Hortonworks Data Platform is deeply integrated with existing datacenter investments allowing users to reuse existing skills. In fact HDP is uniquely sold by many of the major vendors in the ecosystem.
Not only are we the fastest growing Hadoop company, Hortonworks is also a leader…
We contribute more lines of code to the Apache Hadoop than any other company. Our engineers are architects that lead innovation in the open community. Our customer turn to us for these reasons… and the analyst community agrees.
Our momentum is represented in the latest Forrester Wave, wherein Hortonworks was ranked the #1 overall offering – this based on HDP 1.3, a product released in May 2013. As a reference, we are currently on 2.1.
Not surprisingly given our leadership position in the community we received the very highest rating for Vision and Execution, an acknowledgement that our engineering team is driving the majority of the innovation in Hadoop.
And we were also acknowledged for our deep strategic partnerships: in fact HDP was represented in the wave 3 separate times as we ARE the hadoop offering for Microsoft and Teradata, both of whom were ranked in the Wave.
But most importantly: we received the maximum possible score for our support services. This is ultimately the most important decision criteria… can your partner support your critical deployment? Not surprisingly, the people that built the tech are in the best position to support it. It was for this reason that others who package the work done in the Apache Software Foundation (MapR, Cloudera, Pivotal, IBM, etc) are not able to provide the same level of support as Hortonworks.
Finally, there is only ONE Apache Hadoop. Every other package of hadoop is a vendor derivation of the platform.
At Hortonworks, everything we package in HDP is from the very latest components at the apache software foundation. This ensures that our customers have access to the very latest innovation from the community, to which we then apply enterprise software rigor to the build, test and release process to create HDP.
HDP “IS” Apache Hadoop – it is not a vendor derivative that has been forked and modified, it IS Apache Hadoop, no additions, no hold-backs.
When comparing Hadoop offerings vendors it is critical to understand this picture as it makes it clear where vendors are diverging from the community approach and ultimately locking customers out of the community innovation.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
They want single platform to enable batch interactive and real time
So, where does Hadoop fit in the data center? This picture here is a very simple depiction of the typical data architecture in any organization.
- There are sources of data: ERP, CRM, other digital sources
- That data is then stored in a data system: a data warehouse, MPP system, etc
- Then an application of some kind accesses that data system: a packaged application such as Excel or Tableau, a custom application written by a developer, or even another business application
This has been the foundation of the data center for years. We have had some challenges with this architecture all along, however, we are seeing increased pressure to modify and improve this basic blueprint because
A) this approach created silos of data and it was difficult to either share the data or get a single view of it
B) these systems are costly to scale
C) and they are also coupled to a very static schema. Changes to a data model are difficult if not imnpossible. This limits flexibility and iniight.
Finally, the emergence of NEW types of data as we digitize the world around us such as clickstream, machine sensor, etc, are growing at exponential rates. We are all becoming data driven organizations.
In fact that sheer volume of data is to grow 20X between 2013 and 2020 – and which puts tremendous pressure on this architecture. The old architecture is neither technologically nor commercially practical.
In response, many organizations have turned to Apache Hadoop. Originally created by the team at Yahoo, it introduced a scale-out approach to the storage and processing of data that could scale linearly in an extremely cost-effective manner.
However traditional Hadoop has had its own limitations:
- Architecturally, Hadoop in this sense was a purely batch system: load data into HDFS and then utilize MapReduce to run a batch lookup. While useful, it limited the kinds of applications that could be built
- It required a dedicated cluster per use-case: because the lack of a central resource manager meant that a given application would monopolize all of the resources of the cluster until that particular job was completed.
- Traditional Hadoop was also not well suited to integration with existing environments: integration tended to be custom for each application
- Finally, It lacked the enterprise capabilities that mainstream IT require
Rather than enabling a modern data architecture, in some cases it created yet more silo’s.
[Some vendors want to return to this, with a singel engine]
This all changed with the introduction of Hadoop 2 and YARN. Introduced in October, 2013 it changed everything.
Introduced in MR-279 by Arun Murthy in 2009, Arun and the team at Hortonworks architected and led it’s development as the core change in Hadoop 2. Our view was that to truly enable Hadoop as a component of a broad data architecture, YARN was the fundamental requirement as it turns Hadoop from a single application data system to a multi application data system. This is foundational to our approach of innovating from the core outwards to build Enterprise Hadoop.
With YARN it is now possible to land all data in one cluster and then access it in multiple ways: from batch to interactive to real-time.
Today, YARN, at the core of Hadoop is the center of our focus on innovation in and around Hadoop. It is clearly the enabling technology that has started a transition to a data lake within organizations.
Simply stated… Hortonworks Architected & led development of YARN in order to enable the Modern Data Architecture
YARN is relatively the element that enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..
It enables users to:
- Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time.
- It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.
It is the architectural center of Hadoop
- it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated
- It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem
Hadoop has evolved over the years to not only provide linear scale compute and storage, but it also needed explicit functions to make it a complete data platform. These new projects spun up around Hadoop to meet some of the complex requirements of the modern enterprise
A good way to look at the evolution of Hadoop is through this picture.
- When Hadoop began it was simply a data management layer (HDFS) and a single data access engine (MapReduce). Over the past several years the range of components in the Hadoop ecosystem has exploded:
- Data Access - The emergence of multiple access engines spanning SQL, NoSQL, Scripting, Streaming and more. YARN ensures that they all can be part of Hadoop seamlessly.
- Security - To address the key requirements of authorization, access, audit/accounting and data protection
- Operations - Tools to manage the platform
- Governance and integration - Tools to load and manage data according to policy
These are all the core requirements of any data platform and over time the Hadoop community has expanded to include all of these capabilities. The reason that there are 5 categories?
Because each addresses the requirements of each different persona that engages with a data platform.
Developers (Data Access)
Administrators (Security, Operations)
Governance (Data Architects)
Capturing new data and providing the ability to process streams of this data is allowing organizations to shift from taking a REACTIVE, post transaction approach to more of a PROACTIVE, pre decision approach to interactions with their customers, suppliers and employees.
Again, no matter the vertical, this transition is happening.
For instance… read.
Ultimately, most organizations that adopt Hadoop, create a data lake. A data lake provides a single data repository on shared infrastructure and serves the needs of multiple business applications all running on a single set of data. This visionary architecture was not possible until October of 2013 when Hortonworks and the community pushed YARN GA.
With a YARN-based architecture serving as the data operating system for Hadoop 2, HDP takes Hadoop beyond single-use, batch processing to a fully functional, multi-use platform that enables batch, interactive, and real-time data processing. Leading organizations can now use YARN and HDP to process data and derive value from multiple business cases and realize their vision of a data lake.
The value in delivering multiple access methods on a single set of data extends beyond data science. It allows a business to set an architecture where it can deliver multiple value points all across a single set of data to create an enterprise capability previously only imagined. For instance, an organization can analyze real-time clickstream data using Apache Storm to pick off events that need attention, run an Apache Pig script to update product catalog recommendations, and then deliver this information via low-latency access through Apache HBase to millions of web visitors—all in real time, and all on a single set of data.
The modern data architecture simply does not work unless it integrates with the systems and tools you already deploy. HDP enables your existing data platforms to expand the data you have under management through integration. The goal of HDO is to augment not replace these existing systems as we very clearly understand that you need to ruuse skills.
Further, through our work within the Hadoop community to deliver YARN, we have opened up Hadoop and unlocked innovation in the community of data center ISVs can extend their applications so that they can run natively IN Hadoop as just another workload operating on the single set of data lake. They can now function as a first class citizen alongside any other workload in Hadoop.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
They want single platform to enable batch interactive and real time
Hundreds of organizations have turned to Hortonworks because Hadoop is ultimately a platform decision. It is typically the first step towards re-architecting your back end data systems and not to be considered lightly.
These organizations that have already been successful with Hadoop have required not just a stable, reliable and complete Hadoop solution, but more importantly a connection with the architects, builders and operators of this open source technology. They saw this in Hortonworks.
And as with any platform decision, it is imperative that Hadoop integrates with the tools and systems that are already resident in your data center. We forge deep relationships with our hundreds of partners so that you can not only ensure integration but also effectively reapply existing systems and skillsets toward your big data challenges.
At Hortonworks, we hold true to these foundational beliefs and have partnered with hundreds of organizations from some of the largest and earliest big data adopters to the most conservative and data rich companies on the planet. We ensure that your Hadoop journey is successful and more companies are turning to Hortonworks today than any other offering on the marketplace. We invite you to join our community.
Maximize Resource Utilization
No idle standby
Isolate Dev and Test Clusters
Share data not resource
Carve off hardware for a specific group
Prevents a bad map/reduce job from bringing down the cluster
Guarantee Consistency and availability of data
Data is instantly available
Optimized hardware profiles for job specific tasks
Batch
Real-time
NoSQL (HBASE)
Set replication factors per sub-cluster
Use at LAN or WAN scope
Resilient to NameNode failures