Lynx Analytics develops a big graph analysis engine on top of Apache Spark. One of our recent developments is a recurrent neural network library that learns from the structure of the graph in order to predict missing features of vertices.
A real-life use case is demographic estimation where the task is to predict the age of different customers of a telco by exploring their connections to other people, the age of those people and other classical features like internet or phone usage patterns.
One of the main challenges we faced was to develop a training process for our purposes. The usual way of training a supervised learning algorithm considers each vertex as an independent prediction problem. But due to the use of connections between the vertices in our algorithm we cannot treat vertices independently. On the other hand, if you consider the whole graph as one problem, then you do not have any separate training data at all. In this talk we will show some tricks that we used in order to perform the prediction and the training process on the same graph.
The other main challenge is to handle graphs so big that they do not fit into the memory of a single machine and perform really resource-intensive computations on them. To tackle this problem it is necessary to store and make computations on the graph distributedly. The difficulty of this is that we cannot just simply cut the graph into smaller pieces since we need to propagate data via the edges for the training process.
In the talk we will show core algorithmic ideas to tackle the above-mentioned problems and present some experimental results.
2. Who is Lynx Analytics?
Big data analytics with a focus on graph data.
Our core product is a graph-oriented analytics application called LynxKite.
• Web UI for fluid exploration workflow with rapid visualization.
• API and automation for autonomous operation.
• Implemented with Apache Spark.
• Machine learning toolbox.
Telecommunications Financial Services Smart City Transport
START the challenge the prior solutions the novel solution END
3. Problem statement
A single giant graph, such as relations in a social graph.
A partially populated vertex attribute, such as age.
Predict the missing attribute values!
START the challenge the prior solutions the novel solution END
4. Old approach I: Machine Learning
Prediction? Use machine learning! Vertex attributes ⇒ label.
How does it perform?
Entirely ignores edges.
25%
accuracy
START the challenge the prior solutions the novel solution END
5. Anonymized friendship data from now-defunct social network.
• Filtered to single city for faster experimentation
• 27,783 profiles
• 1,095,707 friendships
Age & gender
attributes
Experiment setup
START the challenge the prior solutions the novel solution END
6. The challenge: Multi-class classification by age
• Age is bucketed into quartiles.
• Model is trained on training data (85%).
• Accuracy is evaluated on test data (15%).
• Accuracy =
• Final number is the median accuracy from 11 trials.
Experiment setup
correct predictions
test set size
START the challenge the prior solutions the novel solution END
7. Old approach II: ML with graph metrics
To present the graph structure to the ML algorithm calculate every graph metric we can
think of! Degree, PageRank, clustering coefficient, harmonic centrality, graph coloring...
(Easy with LynxKite.)
How does it perform?
Expert has to find metrics that are good predictors.
Network neighborhood still largely ignored.
32%
accuracy
START the challenge the prior solutions the novel solution END
8. Old approach III: Neighborhood
Take average value of neighbors.
How does it perform?
Not adaptive. (Average not best option in all cases.)
70%
accuracy
START the challenge the prior solutions the novel solution END
9. How does it perform?
Expert has to pick good way to identify communities.
Not adaptive. (Why average? Why most homogeneous?)
73%
accuracy
Old approach IV: Communities
Find communities. (E.g. connected components in overlapping maximal cliques.)
Take average value from most homogeneous community that meets minimal criteria.
START the challenge the prior solutions the novel solution END
10. Old approach V: ML with neighborhood data
In addition to metrics, provide the machine learning with the neighborhood average.
How does it perform?
Expert has to manufacture features for ML.
Not perfectly adaptive. (Why average?)
75%
accuracy
START the challenge the prior solutions the novel solution END
11. New approach: ML on graph data
Avoid all expert decisions. Just train the model on the raw graph. Model can learn to
identify communities or calculate PageRank if those are required for optimal predictions.
How does it perform?
No expert knowledge required.
Adaptively computes the best features.
81%
accuracy
START the challenge the prior solutions the novel solution END
12. Strong recent results with deep learning on graphs with graph convolutional networks.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, Richard Zemel (2016).
Gated Graph Sequence Neural Networks. arXiv:1511.05493 [cs.LG]
A recurrent neural network (GRU) in every vertex with shared parameters.
State is communicated along edges.
Trained on many small labeled graphs. Gives prediction on small unlabeled graph.
Model
START the challenge the prior solutions the novel solution END
13. WW W
WW W
Prediction.
Three copies of the same GRU.
Intermediate state.
More copies of the same network.
Initial state: label (if known) + features.
graph edges
Simplified architecture
START the challenge the prior solutions the novel solution END
14. • Hard to apply supervised learning when we have a single graph.
• Hard to do anything when this graph does not fit on a single machine.
Problems
START the challenge the prior solutions the novel solution END
15. Solution:
• Show some of the known labels.
• Backpropagate error only from the vertices where the label was hidden.
• Hide different labels in each iteration.
Network sees none of the known labels.
Model ignores existing labels.
Network sees all the known labels.
Model learns to just return own label.
Supervised learning on a single graph
START the challenge the prior solutions the novel solution END
16. • Pick representative subgraphs.
• Train in parallel locally on subgraphs. Periodically combine adjustments.
• Prediction on the whole graph is fully distributed.
• Accuracy impact depends on amount of computational resources. Great for
scaling.
Distributed training and prediction
START the challenge the prior solutions the novel solution END
17. Closed-source. Sorry.
Evaluating classical methods and preparing data: LynxKite on Spark
Researching and prototyping neural networks: TensorFlow
Distributed forward pass: LynxKite on Spark
Distributed training:
in development
LynxKite on Spark + TensorFlow
(TensorFrames?)
Implementation
START the challenge the prior solutions the novel solution END
18. @LynxAnalytics @Hanna_Gabor @DanielDarabos
You can find us at booth K2. Swing by to see if we have any swag left!
Special thanks to Gabor Olah, Andras Nemeth,
and many others at Lynx for their contributions.