9. Preference and Similarity Recommendations
User
Movie
1.7MM Nodes
23.9MM Edges
similar cast
prefers
similar
topic
userId: A0A22A5
title: The Godfather
genre: Crime drama
cast: [M. Brando, Al Pacino]
title: Scarface
genre: Crime drama
cast: [Al Pacino, M. Pfeiffer]
title: The Departed
genre: Crime drama
cast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Min-cost path search
10. 10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes
185MM Edges
URL
Domain
IP Address
Calculation of priors
LBP Messaging
Loopy Belief Propagation on the (semantic) web
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
13. You may actually need this
• When the problem is an information
network
• When a graph is a natural way of
expressing the algorithm
• When you want to study specific
relationships
• When you want faster machine learning
or solvers on sparse data
shortest path
central
influence
sub networks
triangle count
14. But there are challenges.
Handling all that
data.
Finding people good at both handling all
that data and data analysis.
Putting exploratory work into production
fast enough to keep up with the
competition.
14
16. It’s a demanding job
Ingest &
Clean
Engineer
Features
Structure
Model
Train
Model
Query &
Analyze
Learn
Visualize
Skills shortage at
intersection of
systems
engineering and
data analysis
Painful data
ingestion and
preparation
Workflows that are not designed
with loopbacks in mind
Few tools for analyzing
semantics at scale
Composing
pipeline is DIY
18. IMAGINE A PLATFORM FOR DATA SCIENTISTS
DOCS + SEMANTICS + MACHINE LEARNING
19. Ease-of-use: Making big data familiar
Python
R
Dataflow
GUI
...
Datacenter / CloudNetworkClient
BIG
DATA
API
Connec
tManag
e
Secure
Analyzedistributed and parallel
Manag
eSecure
Connec
t
Analyzelocal
Query
Big Data Java/Scala/C++
Computational Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data WranglingAnalyst
Skills
The
Other
Skills
20. Delivering it
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND
STATISTICS
Graphical
Algorithms
Classical
Algorithms
Graph
Construction Tools
Useful String
Manipulation
Useful Math
Operators
BIG DATA API
DATA SCIENCE SERVER (Query and Scripting)
Intel Analytics Toolkit
A UNIFIED DOCUMENT + SEMANTIC STORE
The Ask
21. Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising
Label Propagation Structured Prediction Personalized recommendations
Alternating Least Squares (ALS) Collaborative Filtering Recommenders
Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders
Connected Components Graph Analytics Network manipulation, image
analysis
Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering
Structure Attribute Clustering Network analysis, consumer seg
K-Truss Clustering Social network analysis
KNN* Clustering Recommenders
Logistic Regression* Classification Fraud detection
Random Forest* Classification Fraud detection, consumer seg
Generalized Linear Model (Binomial,
Poisson)
Non-linear Curve Fitting Forecasting, pricing, market mix
models
Association Rule Mining Data Mining Market basket analysis,
recommenders
Frequent Pattern Mining* Data Mining Pattern Recognition
Bringing a full spectrum of possibilities
Graph
21
22. Article Tagging Problem
• Articles are tagged by experts with MeSH terms, drawn
from a hierarchical controlled vocabulary of 55,000
keywords
• Process is resource-intensive – can we automate it?
• Categorize articles into a hierarchy that matches the
same categorization from the MeSH controlled
vocabulary
24. Demo: Graph Analytics For Medical Journal
Analysis
INGEST
&
CLEAN
ENGINEER
FEATURES
STRUCTURE
GRAPH
QUERY &
ANALYZE
LEARN
VISUALIZE
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/
VISUALIZE DATA
DETECT
CLUSTERS
USING LDA
• Medline™ XML
• MeSH Ontology XML
• Create list of unique
words
• Stemming and
lemmatization
• Index word list
• Transform articles
into list of article/word
pairs
• Extract vertices
• Assign id columns to
vertex property
• Assign year and
count edge
properties
• Gremlin query for
each visual
• Python web server
and other libraries
• Select
optimization
parameters
• Invoke LDA
26. The Real Playbook
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD
GRAPH
QUERY/
VISUALIZE
DATA
DETECT
CLUSTERS
USING LDA
Parse
Correct mistake
Prepare graph data
Correct schema mistake
Correct aggregation mistake
Data validation
Correct dataset mistake
Guess LDA settings
Tune and re-run
Detect bias in dataset
27. WE NEED THE AGILITY OF INTERACTIVE SCRIPTING
AND
THE
BRAINS AND BRAWN OF
SCALABLE GRAPH ANALYTICS
35. Following Analysis
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Wakefulness
Sleep
Animals
Electroencephalography
Circadian Rhythm
Arousal
Sleep Stages
REM
Mental Recall
Attention
Rats
Child
Evoked Potentials
Aged
Schizophrenia
Ocular
Conditioning
Infant
Psychophysics
Dreams
Top MeSH terms that predict which category an article will be assigned
36. Reimagining 2014
New partnerships in big data
Contributions to the open source community
The Intel Analytics Toolkit – COMING SOON
SEMANTICS + MACHINE LEARNING
TOGETHER AT LAST!
37. INTERESTED IN THE INTEL ANALYTICS
TOOLKIT?
THEODORE.L.WILLKE@INTEL
.COM