This document discusses how MapR Distribution for Hadoop can help enterprises integrate Hadoop into their IT environments. It covers three key trends driving adoption of Hadoop: 1) more data beats better algorithms, 2) big data is overwhelming traditional systems, 3) Hadoop is becoming the disruptive technology at the core of big data. It also discusses two realities: Hadoop is moving towards operational applications and interoperability is key. The document outlines how MapR provides enterprise-grade functionality like high availability, security, integration with open standards to help Hadoop succeed in production environments. It shows how MapR enables both operational and analytical workloads on a single consolidated platform.
The MapR distribution for Hadoop is globally recognized as the technology leader
Forrester published a Wave for Big Data Hadoop Solutions where it placed MapR as the highest ranking product based on current offering as well as roadmap.
Cloud: MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advantages of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers.
MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine.
Hadoop is making CIO’s rethink their data architecture. It is a fundamental shift in the economics of data storage/processing/analytics, and is opening up entirely new business opportunities. Let’s talk about 3 key trends we are seeing, as well as 3 realities or implications on your business and “readiness” to harness the power of big data and Hadoop.
The first trend is that the industry leaders have shown how to use big data to compete and win in their markets. It’s no longer a nice to have – you need big data to compete
Google pioneered MapReduce processing on commodity hardware and used that to catapult themselves to into the leading search engine even though they were 19th in the market
Yahoo! Leveraged these ideas to create Hadoop to keep up with Google and many mainstream companies have followed with new data-driven applications such as “people you may know” (started by LinkedIN and now used by Facebook, Twitter, and every social application), product recommendation engines, contextual and personalized music services (beats), measuring digital media effectiveness (comScore), serving more relevant/targeted ads(Comcast, rubicon project), fraud and risk detection, healthcare efficacy, and more
What makes the difference? A lot of attention is given to data science and developing sophisticated new algorithms, but in many cases just having more data beats better algorithms. (make point on collecting more consumer interaction as well as transaction data, as an example).
In addition, competitive advantage is decided by very small percentages. Just 1% improvement in fraud can mean hundreds $millions in savings. A ½% lift in advertising effectiveness means millions in new product sales and profitability. The same can be applied to customer churn, disease diagnosis, and more.
A second trend in enterprise architecture has been big data overwhelming the existing workload-specific systems which are in production. (list of requirements for each of these on the side in text)
People started with mainframes or operational systems which run ERP, finance, CRM and other mission-critical applications. They require… (pick out attributes you want to stress on the left)
You also have data warehouses, marts, data mining, and other analytical systems which pull data from these operational and other systems for providing insights to the business for decision making
The amount/variety of data has been overloading these systems. You reach a certain point as you try to ingest new types of data when these systems are not cost-effective to scale to terabytes or petabytes of data
Hadoop has become the defacto big data platform which allows organizations to keep up with big data and feed data-driven applications and processes
This chart shows the percentage growth of jobs from Indeed.com.
Compared to other popular technologies such as MongoDB and Cassandra, Hadoop is not only the fastest growing big data technology it’s one of the fastest growing technologies period.
Hadoop has the most robust ecosystem and momentum and is the big data platform of choice for industry-leading, data-driven companies
(Also of interest is that Indeed.com (which is a subsidiary of a Japanese-owned company) is a customer of MapR – they harness and analyze all of the job trends data using MapR)
Hadoop is being used in lots of different use cases across a variety of industries
One way to think of this are functional areas of an organization (from left to right CIO/chief data officer, CMO (marketing), CSO or CRO (chief security or risk), or the COO, head of quality, or IT operations)
We have many customers in each of these areas. Here are some example customers of MapR (give example snippets of each)
You can also put different use cases in each column that are relevant for your customer
The first reality is that as people put Hadoop into production, to relieve the pressure from other systems in their enterprise architecture it needs to reliable . Hadoop needs to be held to the same enterprise standards as your Oracle, SAP, Teradata, NetApp storage, or any other enterprise system.
Many organizations are putting Hadoop into their data center to provide (list of use cases underneath) … it can do all of this and more, but
For Hadoop to act as a system of record , it must provide the same guarantees for SLA’s, performance, data protection, and more
Most importantly, Hadoop has the potential for both analytics AND operations. It can be used to optimize the data warehouse provide batch data refining or storage. But Hadoop can provide many operational analytics or database operations/jobs when done right.
Verizon Teradata example
Less than 10% of CDR’s analyzed
Push messaging. Starbucks or ESPN applications, and others.
MapR is the only software that they pay for. Have HBase committers on staff.
Taken 8 applications clusters and moved into 1 MapR cluster; have 1 cluster with 8 sub-clusters running on different sets of nodes. Data placement control enables this.
Went from 12 CDH servers and cut it down to 6. Just for HBase tables. (They won’t use M7 since they are HBase committers. )
They ran the MinuteSort benchmark, a test which shows how much data you can sort in 1 minute. The Minutesort world record was set by Yahoo by sorting 1.6 terabytes with 2200 nodes. This MapR customer broke the record by sorting 1.65TB with 298 nodes. That’s 1/7th the hardware – that translates into tremendous cost, space, and management savings….
MapR enables integration by providing industry-standard interfaces
More 3rd party solutions work with MapR than any other distribution
Proprietary connectors not needed
NFS
All file-based applications can read and write data
Examples: Linux utilities, file browsers, Informatica UltraMessaging
ODBC 3.52
All BI applications can leverage Hive
Examples: Excel, Crystal Reports, Tableau, MicroStrategy
Linux PAM
Any authentication provider can be used
Examples: LDAP, Kerberos, 3rd party
Because only MapR can reliably run both operational and analytical applications on one platform/cluster, MapR enables a faster closed-loop process between operational applications and analytics. This means:
interactive marketers and algorithms can update the rules engines more quickly and provide more real-time targeting of offers and relevant content to consumers
Fraud models are kept more up to date with the latest patterns to better detect anomalies and take action more quickly on bad actors
More important than our product is ensuring customer success.
MapR creates a new opportunity for enterprises. The Opportunity to revolutionize the enterprise data architecture
From... ‘redundant processing silos’ and ‘data science experiments’. Where you need separate Hadoop clusters for streaming, HDFS/Hive, Hbase and more
To… ‘
To… ‘converged data & processing hub’ that provides a TRUE PRODUCTIon enterprise data hub.
This allows you to consolidate operational and analytical workloads. Not only across Hadoop use cases and applications, but for optimizing your enterprise data architecture