Viele Unternehmen (vom Multinationalen Großunternehmen bis hin zum KMU) experimentieren bereits mit Hadoop als zuverlässige und günstige Datenplattform.
Egal ob als Ersatz für das DWH, parallel zum DWH oder als 'Staging Platform', dem sog. Data Lake, Hadoop hat viele Vorteile was Effizienz und Performance angeht und ist zudem erst einmal lizenzkostenfrei. Der putzige Elefant hat das Potential die Karriere von Linux im Rechenzentrum zu wiederholen.
Für SAS ist Hadoop ein richtiger Glücksgriff. Nicht nur als günstiger und agiler Datenspeicher, sondern auch als Rechenplattform für die verteilten Prozeduren und die massiv parallel rechnende In-Memory Engine "LASR".
Wie SAS einen Hadoop Cluster nutzen kann und wie andere MPP Datenbanken (SAP HANA, Teradata, Pivotal) in dieses Bild passen soll dieser Vortrag zeigen.
Paul Kent
Datenmengen, die zu groß oder zu komplex sind oder sich zu schnell ändern, um sie mit händischen und klassischen Methoden der Datenverarbeitung auszuwerten
Jeder redet darüber – keiner Weiss wie es geht abder jeder denkt der andere macht es – also behauptet jeder er macht es auch
Created by someone working at Yahoo, it was released as an Open Source project in 2008, and today is managed by the non-profit Apache Software Foundation.
Hadoop was built for fast, low cost, efficient, and data-protected file manipulation. It excels at massively-parallelized file manipulation - its ability to handle huge amounts of data – any kind of data – quickly.
It was not built for advanced analytics. Because of its very infrastructure (the nodes don’t intercommunicate except through sorts and shuffles), iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is very inefficient for advanced analytic computing. It has gotten better in recent years, with the addition of many 3rd party (some of them also open source), but remains, essentially, a bulk file and data manipulation storage system.
Here are some of the more popular uses for the framework today:
Low-cost storage and active data archive.
Staging area for a data warehouse and analytics store.
Data lake.
Sandbox for discovery and analysis.
TDWI research (Q2 2014) sponsored by Cloudera, EMC Greenplum, Hortonworks, ParAccel, SAP, SAS, Tableau Software, and Teradata.
Hadoop clearly poised to become a *complement*, and NOT a replacement, to BI, DW, DI and analytics.
However, quite interesting, when asked if Hadoop was currently in production, only 10% of respondents confirmed that their Hadoop deployment was actually used in Production today. Why such a low number?
Well, while Hadoop is driven first and foremost to become an enabler of Analytics, it does not have the actual analytics capabilities built in. Trying to develop those capabilities within the Hadoop ecosystem, using the Hadoop components such as MapReduce, Hive, etc, results in staffing issues and a high cost of in-house development.
A privately-held company established in 1976, SAS is the #1 World Leader in Advanced Business Analytics - 38% Market Share in 2013
We have 14,000 Employees Worldwide, with our solutions deployed in over 135 countries.
SAS leads the world in Analytics (latest reviews by IDC and Forrester), as well as being in the Leader quadrants in *17* of Gartner’s Magic Quadrants (from Data Management, to Business Intelligence to Advanced Analytics)
SAS reported revenues of US$3 billion dollars in 2013, and is famous for its industry-leading 25% reinvestment in R&D.
Why? To provide High-performance Advanced Analytics, Business Intelligence and Data Visualization on a Low Cost, Distributed, Massive Scale.
Big Data, as defined when the Volume, variety of velocity of the data is just too much for an organization’s systems or processes, to manage it in a timely manner to make business decisions, was introduced in 2011. Before that time, virtually no one had heard of these terms in this context.
Interestingly enough, Hadoop was around before the “Big Data” term was coined, and has being on a steady inclined ever since.
Finally, if we look at the interest in analytics, we find also a steady incline for the last 10 years or so.
This is where we are now. What we call, the “Era of Abundance”. Lots of data, the processing power to handle it, and the Intelligence to do the right thing with it.
Letztlich die Zusammenfassung:
Hadoop bringt die Vielfalt der Daten und
SAS die Big Data Analytics Technologie
Zusammen gibt dies das Rezept für neue Use Cases!
Bringing all of the use cases together….
We are an integral part of the rapidly evolving Hadoop ecosystem
We integrate with Hadoop…
We complement Hadoop…
We go beyond what Hadoop can offer for each component of the Analytics Lifecycle…
Three years ago we told you about hadoop and its limitations … now the market and the community has responded… SAS leads the way with in-memory and alternative parallel processing patterns … edge…
Moving from left to right…all of the four above mentioned design patterns are covered..the goal is how can we meet SAS user needs today and in future…and also understand and meet the needs of new generation of users (e.g. data scientists)….
SAS Data Loader for Hadoop is a new offering from SAS purpose built to solve the big data challenges that we talked about previously.
It has a web-based wizard-driven user interface that minimizes the need for training and improves the productivity of business analysts and data scientists working with hadoop.
Certified by Cloudera and Hortonworks
The From approach is the “traditional” established SAS approach, where Hadoop can be treated simply like any other data source. As noted on the slide, the FROM is really bi-directional, and can write back TO Hadoop using the same approach. It is mostly meant to represent that we mainly take data FROM the Hadoop cluster to process in a SAS environment.
With the WITH approach, SAS introduces a number of concepts.
First, we now have the LASR Analytics Server. This is a core piece of our technology that allows for massively parallel, distributed, in-memory processing of advanced analytics. LASR is a purpose-built analytics server, that can run advanced analytics, in a massively parallelized environment (meaning it leverages memory *and* processing from multiple servers). Since it was built for advanced analytics, it can produce results faster and with very few instructions – whereas the same results on Hadoop are traditionally produced using hundreds - even thousands of lines of codes.
Second, we also have the SAS Embedded Process, which is a light weight, non-invasive technology that allows the communication with and the leverage of Hadoop technologies to lift data into memory in an optimized, extremely fast way. Notice the multiple arrows, which means that if you have, say, 16 data nodes, you will be able to parallelize and lift the data in SAS’ in-memory environment 16 times faster.
This WITH concept really means “BESIDES HADOOP”, or ALONGSIDE it. As long as we’re leveraging massive parallelization for both the data and the processing.
The ‘IN’ approach also leverages the light weight SAS Embedded Process, but this time, it is to run specialized SAS code (data quality, data transformation and manipulation, scoring) directly in the Hadoop cluster… effectively leveraging the massive Hadoop parallel processing and native resources such as MapReduce.
Not all SAS code can be executed this way. Strategic deployment such as Scoring code, Data transformation code or Data Quality code can be applied in this manner.
SAS has been doing this for a very long time… this is not new for us. Taking sophisticated scoring code and running it in place, inside a database. Now we’re extending this capability to Hadoop.
This is ideal when data is so voluminous that lifting it all in memory would be prohibitive. We can explore the data to find what is relevant even before doing data transformations for modelling. Alternatively, we can also “model at scale” – the idea of automatically segmenting the data (with tools such as SAS® Visual Statistics) and then building models by segment.
Another version of the ‘IN’ approach… where the in-memory solutions from SAS are deployed IN the Hadoop cluster, effectively sharing the cluster with HADOOP, and leveraging YARN to manage necessary resources.
The complete analytical life cycle is important to understand, as this is the reality most companies face:
- Data needs to be prepared specifically for analytics (a crucial step), then it needs to be explored in a highly efficient environment, purpose built for interactive visualization, then it needs to be modeled in a purpose built advanced analytics environment. Finally, many times the final scoring can happen where the bulk of the data reside, in Hadoop.
Through it all, key metadata act as glue, ensuring proper governance of the processes and data, tracking lineage and impact analysis, so that the user can know what may result from any changes at any point in the cycle.
The ultimate goal was to position the most adequate advertising to a given visiting customer on Rogers’ web site.
Traits are a characteristics/parameter of each visit. For example, the time of a visit, the number of clicks, the target browser, the device used (iPad, Samsung, etc). The 600 traits used in the final model were actually derived from a list of 75,000 original traits.
http://youtu.be/wTnkg16jHwg
The initial objective: stop the “one size fits all email marketing” approach, resulting in a reduction of 20% in churn subscription. This lead to generating more accurate, real-time decisions about customer preferences. The ability to gain customer insight across channels is a critical part of improving customer satisfaction and revenues, and Macys.com uses SAS to validate and guide the site's cross- and up-sell offer algorithms.
http://www.sas.com/en_us/customers/macys.html
This diagram shows two axes, degree of intelligence and the level of competitive advantage that can be achieved. I am going to propose that using data and applying analytics to data, can accelerate the loop of Intelligence and Experience that links strategy and operations.
Most would agree that in the area of collections and recoveries, the historic intelligence is not a great predictor of the present, yet alone the future.
Organisations start with data and may build data marts to allow them to access the data locked away in operational systems. Some bring in most data, others pick the data sources that are based upon past experience, so often complaints data and call centre file notes are omitted and yet both could be really useful in segmentation and predictive modelling.
The data needs to be cleaned up as it is consolidated – garbage in garbage out!
Then there is a whole set of reports queries and alerts that tell you where you have been or may also tell you where you are today, providing the information is available fast enough.
But it is when you start to apply analytics to the data that business intelligence and competitive starts to grow
Exploring data is all about understanding more about the data and relationships between data sources than you knew from experience or intuition. Yes we may know that impairment on zero rate balance transfers on the credit card are a I risk group, but what other factors are key in determining the different segments that we may wish to apply different collections strategies to?
Forecasting is not about continuing the line on the graph, but about applying a range of forecasting analytical techniques to sets of data to work out what is most likely to occur in the future.
Prediction involves building models based upon past experience. These models may be very complex and predict a binary outcome or a probability outcome. So for example, we might explore the customer base to identify the factors that are most likely to lead to default, purchase or churn. We could then build models based upon that data and predict which customers may impair and if they did, which would respond best to pre-delinquency contact?
Finally, the pinnacle of the use of analytics is the use of optimisation analytics to deploy resources appropriately to achieve the greatest collection of debt with business constraints. So by way of example, if we wanted to put all impaired customers through at least one collection strategy, pull the over 90 days debt down by 30%, tackle a problem with the silver card customers, make only one call to each customer a week, have enough call centre staff to avoid any caller waiting longer the 5 seconds for an answer, what would be the right level of outbound mailing to generate the optimum level of collections whilst giving all objectives an appropriate level of attention.
I will explore this in more detail later.
That’s the power of SAS Analytics.
According to Gartner (in a report issued February 2008): “SAS dominates in advanced analytic solutions. No other vendor in the Magic Quadrant has its range of capabilities or can point to the same number of advanced analytic deployments.”
Forrester Research (in a report issued July 2008) says that “SAS remains the best game in town for fully integrated high-end analytics from a single vendor.”
Why Hadoop is being considered (or has been implemented), and HOW it will actually be exploited to derive value, are sometimes two very different things… Depending on the point of view.
These are the key value drivers regarding how SAS affects Hadoop.
Analysts, statisticians, data scientists, etc will be very interested in increasing the ACCURACY of their analysis, mainly because they can now:
Run their analysis on more data (sometimes even all the data); and
Run more complex algorithms because of the massively parallel processing
The SCALABILITY will generally be a concern of IT as well as the Business side of things, but maybe not from the same angle. IT want to make sure they do not paint themselves into a corner, and that whatever architecture they deploy will meet the needs of the business down the road, while business just want to be able to embrace all of the Big Data coming their way.
IT folks will likely be very focused on the GOVERNANCE of data: making sure it is properly secured, it is comprehensive and timely, etc.
Finally, the VALUE (ECONOMICS) of the project needs to be embraced and recognized by all. Economics can be derived by:
Increased self service of Hadoop data acquisition by SAS Analysts increases ability to generated insight from Hadoop as a new and rich data source
Better value from Hadoop data is enabled through scale and accuracy of analytics possible through the SAS LASR Server (the ‘With’ approach). Insights are not bound by processing capability
Better value from Hadoop data used for analytic insight (better quality, data shaping for analytics and ease of score code deployment in place) and ability to deploy In-Memory capabilities in the Hadoop cluster