The document discusses strategies and technologies for scalable analytics using modern data architectures like Hadoop. It describes how declining storage costs and increasing CPU speeds have enabled organizations to leverage huge amounts of data through platforms like Hadoop. The document also summarizes SAS's big data strategy, how its technologies integrate with Hadoop, and how organizations can use SAS solutions to extract insights from data through the entire analytics lifecycle including data preparation, modeling, visualization and more.
Currently, most organizations use one or more of these techniques to solve a departmental specific business problem. As such, even though valuable, from an enterprise standpoint, the value is somewhat marginalized or not optimized. What we see is that departments, such as Marketing, Finance or Risk are using one technique to solve a problem but not multiple techniques. So, for instance, in a bank, the risk department will use econometric forecasting to manage the treasury portfolio and overall riskâŠso they think. But they are not proactively looking at text (emails and chat), to see where the bank could be exposed. Or a retailer uses forecasting to predict what will sell, but they are not looking at on line sentiment and customer segmentation to optimize what product they should offer what customer and how to optimize distribution all as part of the same process. These are examples of why these techniques should be pulled together for the greater good.
When we plug them all together this is a simplified view of the classical situation today at many customers. They have data flowing left to right normally in a batch job. Some advanced customers try to move data more or less in real-time from left to right or minimize the time it takes to go from left to right. The data and analytic marts are updated when the EDW is refreshed .. Most of that deals with the structured data. The other sources are normally handled differently if at all outside of that flow. As you can see there are multiple places where data is being stored, in different formats and when you add it all together this is why the data landscape of organizations is normally very expensive. What is clear is that with growing data volumes this is a space where more and more cost is going to be incurred. It therefore makes a lot of sense that Hadoop is being looked at as it promises to get that cost back under control and bring more of this data into one common managed lower cost data architecture!
In the first scenario you see organizations are looking at Hadoop to handle new types of data that are not yet currently under the control of the EDW. This includes unstructured Data, Semi-Structured Data or Data that is ânot yet known to be usefulâ. Companies like this as it does not impact existing warehouse or mart efforts but it allows them to try to extract value from data that they have which may not be being utilized today. In this setup all the ânew dataâ can be brought together in one place irrelevant of format and then experimentation to extract value can start. Mostly, in this scenario, customers are leveraging the HDFS component of Hadoop together with side projects such as Apache Tika which lets you tag unstructured data, much like we do with SAS Content Categorization although not as advanced, so that you can search for documents with key words etc. In general people see Hadoop in this world as a way to support innovative business strategies requiring new data and/or as a way to get existing unstructured and semi-structured data into one governed location at the lowest cost. Given the small number of users and experimental nature of this space it is often where you will find the most coders working on Hadoop.
In the second scenario we find organizations looking to use Hadoop to handle the new types of data as in the last scenario but then to feed new insights into the EDW for mass consumption. Essentially in this model Hadoop is used for unstructured, semi-structured and not yet known to be valuable data. In this scenario again the existing EDW process is not impacted but extra data flows from Hadoop might be added when there is something valuable that needs exposing to the masses. Essentially Hadoop complements the rest of the data strategy and the EDW remains the single source for most users even if they access via a downstream mart.
Earlier I mentioned that when something valuable was found in Hadoop it may be added to the EDW. In this scenario things that are useful are often found through the use of BI and Analytics on the data held in Hadoop. Sometimes that will be direct against Hadoop and sometimes there will be some data transformation back out of Hadoop into some other RDBMS or other valid store where people will then work hence why I have drawn some marts there. The number of users against Hadoop in this case will increase over the first scenario but it may still not be for the masses.
The real idea here is to contain the cost of a burgeoning EDW by not simply throwing all the data directly there as we would have done in the past until it is found to be useful for the masses thus slowing the warehouse growth. Secondly the hope is that this new environment will provide the organization a very low cost way to incubate innovative business strategies that often require massive volumes and varieties of data which once proven might be supported in a more robust and costly EDW. At the end of the day the name of the game is to move only what is valuable to the expensive EDW store going forwards but at the same time not disrupting what is in place today.
Something to remember here is that this is the scenario being encouraged by people like Teradata, IBM and Oracle who now sell Hadoop appliances so do not be surprised to hear this as a way forwards from some customers in the future since it does guarantee continued EDW growth which is something they are very worried about because of the advent of Hadoop.
The last scenario is perhaps the most explosive and the one that requires changes on all sides of the IT landscape. Interestingly I have seen this strategy at a large financial institution as well as at some smaller customers. The idea here is to start to put ALL data into Hadoop first. From there a number of things can happen:
One
The Hadoop platform might just be used as a place to land, transform and cleanse data before building the more traditional EDW and Data or Analytic Marts. If you like a powerful staging layer. Effectively what this does is allows you to offload all the data transformations from EDW and Mart process to Hadoop so that you are essentially able to leverage all the power of the set of computers to prepare data faster than ever and then just copy it to where it needs to go for downstream applications to use. HSBC, who were spoken about in a previous session, gave an example of this usage scenario having got an existing batch job to go from 3 hours to 10 mins on Hadoop. Just think about Hadoop becoming the ETL engine for all data in an organization!
Two
In some cases the idea of downstream marts goes away and all historical and detailed data is kept forever accessible in Hadoop making it the EDW. This is the ultimate aim of the Hadoop vendors and why the EDW and data appliance guys are so worried. The truth is getting rid of some sort of EDW is not going to happen anytime soon because of the relative immaturity of parts of the Hadoop ecosystem which make it not really suitable for a number of regulatory type workloads. So in my opinion it is very likely we will EDWâs continue for some time and that we will see marts built off Hadoop to keep things running today and continue to leverage previous investments.
At the end of the day you will hear the phrase âdata lakeâ. In some companies this means moving all their data to Hadoop and in other companies it is the term used to describe their next generation data landscape where Hadoop plays a role.
SAS enables the entire lifecycle around Hadoop.
In a Bi/Reporting context, this can mean a traditional, Business as Usual approach, using SAS/Access to access Hadoop as a data store, just like we do with an RDBMS and accessing with a SAS client such as Enterprise Guide. This can also mean a transformational approach, leveraging in-memory Hadoop architectures for unprecedented performance and interactive visualization capabilities with Visual Analytics.
In an analytics context, this can also mean a traditional, Business as usual approach, using SAS/Access to access Hadoop as a data store with SAS/STAT or Enterprise Miner. This can also mean a transformational approach, leveraging in-memory Hadoop architectures for unprecendeted performance (via LASR server), advanced analytic exploration capabilities (via SAS In-Memory Statistics for Hadoop) and advanced analytic prototyping and visualization (via SAS Visual Statistics). Finally, this also means exploiting Hadoop for operational analytics by leveraging in-database technologies to score inside of Hadoop (using the SAS Scoring Accelerator for Hadoop and the SAS Code Accelerator for Hadoop).
We should also continue to emphasize how the SAS High-Performance Analytics Server Products, in tandem with SAS Enterprise Miner and SAS Decision Manager, allow organizations to take analytically-derived strategic insights and push them to decision points throughout the organization. This integrated ability to centralize and operationalize analytics, remains one of SASâ key differentiators against different types of key competitors.
Updated: MARCH 2014 for Visual Analytics release of v6.4
Example:
Applying text analysis to twitter streams, or to customer comments in call logs, can give quick insight into the âhot topicsâ discussed.
Itâs more than a simple world=cloud that shows which topics are being discussed, but the analytics applied behind the scenes determine which words are used most frequently âso you can determine which topics are the most âimportantâ and warrant further understanding/exploration.
SAS Visual Statistics 6.4
Release Date/Month: July 2014
Contact: Tapan Patel
Irrespective of big data or large data, every analytics project should go through the iterative analytics (data to decision) lifecycle. Typically four steps involved are: manage/prepare data, explore/visualize, model and deploy & monitor.
The role of SAS Visual Statistics is to (primarily) address the data exploration/visualization and model development stages of the analytics lifecycle.
It allows customers to understand on why certain events, outcomes happen and what are the key relationships. Users ask for more interactions from the data, demand drill-down, etc. to identify the root cause and use the information to build predictive models.
It allows customers to build and refine predictive models to assess a future outcome and explain what will happen? For example, is the transaction fraudulent or not or to assess future risk of repayment or how risky is the portfolio given certain conditions in future? Users can dynamically see the impact of changing model properties/parameters and fine tune the model to arrive at the desired results.
Of course SAS Visual Statistics also provides the capability to generate score code for deployment purposes.
Clustering is the task of segmenting a heterogeneous population into a number of more homogenous subgroups or clusters. Segmentation does not rely on predefined classes or examples. The records are grouped together on the basis of self-similarity. Clustering is often done as a prelude to some other form of data mining. For example, market segmentation â cluster of customers with similar buying habits and find out which promotion would work best.
Classification deals with prediction of discrete outcomes: Yes/No, Churn/No churn, Fraud/No Fraud, credit applicants as low, medium or high risk. Estimation is another form of classification task wherein it deals with continuously valued outcomes (i.e. individual records are rank ordered). Examples â estimating credit card balance, propensity to purchase, probability that someone will respond to balance transfer notification, etc.
Prediction deals with classifying records according to some future behavior or estimated future valueâŠfor example, predicting the size of the balance that will be transferred if a credit card prospects accepts a balance transfer offer, predicting box office receipts
SAS Visual Analytics provides the core capability for ad-hoc data discovery, data exploration and basis data preparation required for model building.
Precision - You can leverage the most proven and state-of-the-art analytical algorithms, text analytics, forecasting, recommendation engine, and machine learning techniques to get the best business results
Scalable â As you data, users and problems get more complex, we can scale.
Speed â Memory and data efficient for a significant reduction of data latency to rapidly analyze large and complex data in Hadoop
Interactive - Multi-user interactive analytics environment for increased productivity
Â