In this presentation Nachum Shacham talks about the uses and qualities of Big Data, and how they are utilised where he works at PayPal. He talks about the ultimate goal of extracting business value, as well as unlocking the true value of your data through use of algorithms and sufficient data further down the long tail.
4. MIXED SIGNALS FROM THE PUNDITS
• Data Lake
• “Needle in a hay stack”
• “All hay no needles”
• “Yet another fad”
• “Noth’n new: we’ve been analyzing
data for 30 years”
4 DM Tech Forum
• “Store’em and they’ll come”
• “Don’t ever discard data”
• “$524.752MM ROI in 3 years”
• “Smart” …
• “Hadoop is free”
• “Just…”
5. USE YOUR OWN FILTER
• Sift facts from MBS
• Seek factual 1-liners
• See through metaphors
• Discount “Smart” (data, algorithms, systems)
• Be skeptical
5 DM Tech Forum
6. UNLOCK THE VALUE IN BIG DATA
• Data Trumps Algorithms
• Sufficient data further down the long tail
• Wisdom of the crowd effective recommendations
• Combine signals from different media
6 DM Tech Forum
7. BUSINESS VALUE IN BIG DATA
7 DM Tech Forum
RISK ANALYSIS
IDENTIFY INFLUENCERS IN
SOCIAL GRAPHONLINE ADS
REVENUE OPTIMIZATION
FRAUD DETECTION
AND PREVENTION
8. LET’S DIG INTO BIG DATA
• Define KPIs
• Explore
• Model & Measure
• Visualize signals
• Test
• Question test results
• Rinse and Repeat
8 DM Tech Forum
9. BIG-DATA ANALYTICS
FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS
9
MapAttempt TASK_TYPE="SETUP"
TASKID="task_201212150932_52151_m_000051"
TASK_ATTEMPT_ID="attempt_201212150932_52151_m
_000051_0" TASK_STATUS="SUCCESS"
Task TASKID="task_201212150932_52151_m_000051"
TASK_TYPE="SETUP" TASK_STATUS="SUCCESS"
FINISH_TIME="1355822133162"
COUNTERS="{(FileSystemCounters)(FileSystemCounter
s)[(FILE_BYTES_WRITTEN)
11. CLASSES OF ANALYTICS JOBS
Big
Data
Data
organization
for BI
A few
large
models
Many
small
models
11
DATA MANIPULATION
GRAPHICS
MODEL BUILDING
CROSS VALIDATION
PROBLEM MR
FORMULATION
14. data files
process lines
set sorting key and value
output <key, value>
Collect segment data marked by key
Process segment data
Output processed segment data
Shuffle sort
Reducer.R
Mapper.py
Text processing
Model per segment
BI-LINGUAL HADOOP STREAMING:
LARGE SCALE PARALLEL PREDICTIVE MODELING
18. MODELS BUILT ON LARGE DATASETS
18
Meta VERSION="1" .
Job JOBID="job_201112150932_52151"
JOBNAME=”DataFilter"
USER=”user1234”
LAUNCH_TIME="1324801865576”
TIME INTERVAL DATA
CONCURRENCY
PERCENTILES
TIME SERIESWORD COUNT
REPRESENTATION
AVOID RAM LIMITATIONS
R STAT
PROCESSING
20. TERADATAR FUNCTIONS (SAMPLE)
Function Name What it does
td.zscore Zscore Transformation
td.t.paired T Test Paired
td.cor Correlation Matrix
td.f.oneway One way F Test
td.factanal Factor Analysis
td.freq Frequency Analysis
td.hist Histograms
td.kmeans K-Means Clustering
td.ks Kolmogorov Smirnov Test
td.mode Mode Value of Column
td.tapply Apply a function over a database column
td.summary Like R summary()
td.quantiles Quantile Values
td.rank Rank
21. ANALYSIS OF A TABLE WITH > 1B ROWS
>library(RJDBC)
>library(teradataR)
>tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”)
> system.time(myTbldf <- td.data.frame(”myTbl"))
user system elapsed
0.092 0.054 140.071
> dim(myTbldf )
[1] 1,131,670,269 9
> system.time(cor <- td.cor(myTbldf[3:9]))
user system elapsed
0.021 0.003 6.722
C D E F G H I
C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803
D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683
E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034
F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032
G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435
H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733
I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
22. CONCLUSION
• Big data is here. See through the hype
• Analyze big data to extract value
• Multiple technologies & analytics tools are out there
• Match platform, tools and approach
• Delegate massive processing to big clusters
24. BIG DATA EMPOWERS ALGORITHMS
Banko & Brill “Scaling to Very Very Large Corpora for
Natural Language Disambiguation”
Hinweis der Redaktion
Big data is here, and corporations leverage MPP platforms like Hadoop and Teradata, for cost effective storage and processing of vast amounts of data. However, mining the business benefits of big data requires new approaches for deep analytics including predictive modeling and statistical analysis.Modeling big data requires a comprehensive process that includes noisy data of different structures, and done in parallel on large number of processorsStill need to perform the analytics tasks in a cost effective manner.We describe our experience in running statistical analysis and modeling of big data.We will review and compare the platforms we use to store and process the data.Then describe integrating processing with R, Python and SQL on Hadoop and Teradata for a range of analytics tasks. .
The large volumes of data need to be stored and processed on data platforms, which are clusters of computers with vast storage and processing power.The data consist of combination of structured, semi-structured and unstructured data, that needs special processing for cleansing the data, reshaping the data for modeling, and a large set of algorithms to extract the value from the data.Big data contain sufficient amount of information for analysis of otherwise too-small segments of the market. The sheer combinations of those segments can yield a wealth of patterns that can be mined for the corporation. As more people get to view and explore the data, the more patterns will be identified, increasing the value to the corporationThus, making big data analysis feasible to large groups of people, beyond few developers, will lead to more interaction with the data hence to more benefits.
Big data offer many opportunities to corporations to extract signals to guide profitable decisions.A large portion of the new big data comes from the wild in unstructured and semi-structured formatsThese data need to be cleansed and structured to enable the computation of statistical metrics and construction of predictive modelsThe volume and format and the wealth of analysis tasks requires application of different tools and environments to store and process the data.The patterns and signals in the data are more likely to be extracted when large number of analysts are given access and can construct their own models.Thus, make the tools available and accessible to the many.
The most common architecture for big data is MPP.RDBMS and Hadoop are the most common architectures.They are similar in employing a large number of processors and disks and distributing the processing to where the data areRDBMS and Hadoop offer different programming environments and performance characteristics.Companies are increasingly deploying both platforms to accommodate a wide spectrum of business analytics needs.When supporting multiple concurrent user jobs they have to deliver not only data and computation but also quality of service that match users’ expectations.How to allocate workloads to platforms to maximize value is an area of active research.A large number of programming languages and tools have been developed for these platforms. Java, PIG, Hive, and scala are powerful tools that many organizations have adopted.We have found Python and R to be particularly attractive to the analytics tasks that we are performing. They are well established languages that many analysts have been using for years on smaller datasets. When combined in the the Streaming frameworks, R and Python can be used to create models quickly and in code that is clear and concise. Their packages provide many models and processing tasks out of the box.Teradata offers strong SQL implementation with many extension UDFs designed for processing of semi-structured data in textual format.An R package was recently published that enables using the processing power of the cluster for many statistical functions for running on massive datasets
This table compares the platforms based on the types of processing tasks. For example, scanning large tables of text is most suitable for Hadoop whereas jobs that modify tables or search based on primary index are more efficiently performed in TD.Special functions can be written more easily for Hadoop whereas join to 2 large tables is more easily done on TD.When data are replicated across multiple platforms, such tables are used to decide on the best platform to run particular jobs.
We now turn to the topic of creating and running the actual analytics tasks. R is a powerful language that was designed for data analysis and statistical modeling. It has functions and packages for processing data at all the steps of the data analysis cycles: from sourcing the data from RDBMS, flat files, or the web through data preparation, exploratory data analysis, model creation for all imaginable statistical test or algorithm, DOE, model validation, variable selection, all the way to creation of charge and graphs for presenting the results R is gaining in popularity and has been place in the top 20 programming languages. However, in our experience we found Python to be more effective in text processing.which calls for using both languages in Hadoop tasks.
On Hadoop, the Streaming framework enables us to run mapper and reducer in different languages. In this environment, the mapper is written in Python and the reducer in R.The cleansed and filtered map data is send to the framwork with proper keys that deliver to the reducer the data in logical chunks, each of each is considered as a statistical dataset, in the form of data.frame.The model is built on these data frames in the same way it has been traditionally done in R. Only in Hadoop, all reducers perform these task in parallel.