SoftServe Innovation Conference in Austin, Texas 2013
Building Predictive Analytics on Big Data Platforms presented by Olha Hrytsay (BI Consultant) and Serhiy Shelpuk (Lead Data Scientist)
9. “Data are becoming the new raw
material of business”
- Craig Mundie, head of research and strategy, Microsoft
10. Modeling true risk
Network data analysis to
predict failure
Customer churn analysis
Threat analysis
Recommendations
Feature Usage analysis
Ad targeting
…
11. Collect and
Store
• Complex data (text
files, audio, video, images, …)
• Multiple sources
• Lots of data
Process
• Batch processing
• Parallel execution
• Cluster solution
Analyze
•
•
•
•
•
Simple visualization (reports, dashboard)
Text mining
Sentiment analysis
Prediction models
Collaborative filtering
12.
13. Event sources (Log files, Windows Event Log, WMI, SNMP, database, etc.)
Event Storage
Event Aggregation and
Transformation
Event Transport
Event Serialization and
Archiving
Event Processing
and
Analytics
Presentation
Query
Engine
Interactive
Search
User
Full-text
Search engine
Event DB
Rules
Engine
Reports and
Dashboards
Full-text
Index
Predictive
Analytics
Alerts
Visualization
E-mail, SMS, SNMP, etc.
Operational Management Tools
Event Ingestion
14. Event sources (Log files, Windows Event Log, WMI, SNMP, database, etc.)
Event Storage
Event Transport
Event Aggregation and
Apache Flume Transformation
Event Serialization and
Archiving
Protobuf, Avro, Thrif
t, MessagePack
Event Processing
and
Analytics
Presentation
Query
Engine
Impala
Interactive
Search
Custom
User
Full-text
Solr, ElasticSe
Search engine
arch
Full-text
Event DB
HDFS, Hbase, Cas Index
sandra
Rules
Engine
Drools
Reports and
JasperSoft,
Dashboards
Tableau
Predictive
Analytics
R
Alerts
Visualization
Custom
E-mail, SMS, SNMP, etc.
Operational Management Tools
Event Ingestion
Cloudera
Manager, Apache
Ambari
15. “The idea that the future is
unpredictable is undermined every
day by the ease with which the
past is explained”
― Daniel Kahneman, Thinking, Fast and Slow
16. More data is
available for
companies
Storage
technologies
allow to store
and operate it
Advanced
analytics could
be applied to
this new data to
achieve
competitive
advantage
18. Senior
(Executive)
Management
Ambiguity
The goals to be achieved or the problem to be solved is unclear
Alternatives are difficult to define
Information about outcomes is unavailable.
Uncertainty
Middle
Management
Managers know which goals they wish to achieve.
Information about alternatives and future events is incomplete.
Risk
Junior (Line)
Management
A decision has clear goals and good information is available, but the
future outcomes associated with each alternative are subject to chance.
Certainty
All of the information the decision maker needs is fully available
19. Define objective
• Increase customer
satisfaction level
• Identify
prospective
customers
• Identify crossselling
opportunities
• Decrease time to
market
• Decrease costs of
marketing
campaigns
Identify
data sets
Design the
model
• Historical data on • Classification
model for Internet
customers from
users defining
CRM system
what one is
• Geographical
interested in
location data
• Smartphone data • Adaptive control
models for
• Social network
managing IT and
data
network
• Text data from the
infrastructure
Internet pages
• Probabilistic
• Image data from
model for defining
the medical
credit worthiness
sources
Design the
solution
• Data storage type
• Logical database
design
• Availability and
scalability of the
solution
• Integration into
corporate
information
environment
• Solution
deployment
model
Implement
the solution
• Add new
functionality to
the existing
corporate BI
platform
• Implement new BI
solution
• Enrich existing
business system
(CRM, ERP) with
the predictive
analytics
functionality
20. Business
Tasks
Model Family
Algorithms
• Define prospective
customers
• Define traffic jams in
the city
• Recommend
restaurants and menus
• Adjust UI to the
particular user
• Classify body part on
X-Ray image
• Define market
niche
• Define influencers
in the social
networks
• Define similar
customers or
projects in
portfolio
• Define informal
groups in the
organization
• Define fraud bank
transaction
• Define network
intrusion attempts
• Provide automatic
aircraft engine
testing
• Provide automatic IT
infrastructure
monitoring
• Provide clinical test
analysis
• Define the best
price for the goods
or services to
maximize profits
• Define best working
schedule for the
store
• Define best amount
of production
• Define best
business rules
Classification
Clustering
Anomaly Detection
Optimization
• Naïve Bayes
• Logistic regression
• Support Vector
Machines
• Neural Networks
• K-Means
• K nearest
neighbor
• Self-organized
maps
• Mixture of
Gaussians
• Mixture of Gaussians
• Self-learning
anomaly detection
•
•
•
•
•
Gradient descent
Simplex method
Newton’s method
Normal equations
Genetic algorithms
21. Google to Buy Waze
for $1.3 Billion
Xerox plans to clear
traffic on I-10
The promise of better
data has MetLife investing
$300M in new tech
Gracenote did a whole
business on recommending
music
Obama’s data scientists built
a volunteer army on Facebook
22.
23. Description:
Cloud-based service for providing more
accurate estimates of the credit
worthiness (loan scoring) using publicly
available data from social networks.
Service is oriented to be used by banks.
Technologies:
Amazon EC2
MySQL
SAP HANA
R
JAVA
Credit Score
25. Description:
Computer aid diagnostic
system that can
recognize human body
part on X-Ray image and
detect broken or
fractured bones
X-Ray Image
Technologies:
Matlab/Octave
Python
PyBrain
NumPy
SciPy
Analytical Engine
This is a hand.
Broken bone
detected