Title: Using Graph Theory to understand User Intent
Subtitle: Graph-based Natural Language Processing applied to real-time Machine Learning
Abstract:
We are in a Graph Renaissance period. The advent of high-performance free/open-source software combined with inexpensive Cloud computing platforms enable graphs of information to be manipulated and utilised at scales never before seen. While use-cases like mining social and web data with graphs are common-place, their use in Natural Language Processing has largely been overlooked. In this presentation Michael Cutler will describe how TUMRA have used graph-based NLP algorithms as a core component of their upcoming digital marketing product TUMRA Optimize.
Presenter: Michael Cutler
Bio:
Michael is the CTO co-founder of TUMRA, a Data Science startup based in Chiswick, West London. First discovering Hadoop back in 2008, Michael has been following the bleeding edge of ‘Big Data’ technology since before it was called ‘Big Data’ and has applied it to solve real-world problems.
Before starting TUMRA, Michael was a senior researcher in the R&D labs for British Sky Broadcasting, inventing new technologies and solutions for everything from Satellite, Video and Network systems through to Web and Mobile-based applications.
Website: http://tumra.com http://cotdp.com
Twitter: @tumra @cotdp
Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)
1. Using Graph Theory to understand Intent & Concepts – January 2013
tumra.com
2. UNDERSTANDING INTENT & CONCEPTS
• Use case:
- Enhancing Social TV user experience
- Matching users to content that interests them
• Topics we’ll cover:
- Natural Language Processing
- Graph Theory
- Machine Learning
tumra.com
3. USE CASE ENHANCED SOCIAL TV
• Objectives:
- Increase engagement with content
- Enhance multi-channel user experience
• We built a prototype solution:
- Mines unstructured data in real-time
- Understands:
- What interests individual users
- Entities & Concepts (People, Places, Events)
tumra.com
4. THE CHALLENGE
THANKS FORtoLISTENING
Help users to “follow the story” regardless of the
news outlet, integrated web / second-screen
tumra.com
Photo Credit: byrion on Flickr (cc)
6. THE PROBLEM
• Little useful data to work with…
- Streams of continuous live TV
- Have to create metadata
• Where did we start?
- Ingest several live news channels
- Extract whatever data was available:
- In-video text using OCR
- Subtitles / Closed Captions
tumra.com
7. STEP 1 NAMED ENTITY RECOGNITION
We used a simple N-Gram model for exact matches;
then Apache Lucene for everything else…
tumra.com
8. EXAMPLE N.E.R.
“David Cameron and the German
Chancellor Angela Merkel meets to
discuss the debt crisis and signal
their approval for greater eurozone
integration.”
tumra.com
9. EXAMPLE N.E.R.
“David Cameron and the German
Chancellor Angela Merkel meets to
discuss the debt crisis and signal
their approval for greater eurozone
integration.”
tumra.com
10. INITIAL SOLUTION
NoSQL
Unstructured
Awesomeness!
Data
NER
tumra.com
12. DISAMBIGUATION
• Which “David Cameron”?
- We have many in our Knowledgebase
- Sportsmen, actors, painters & characters…
• Our initial simplistic approach was naïve
- Works great with unambiguous matches
- Best-case returns top-scoring entity
• We needed a smarter approach
tumra.com
13. RECAP
• We have an effectively ‘flat’ KB of Entities
- “David Cameron” -> Politician (Person)
- “Angela Merkel” -> Politician (Person)
- “German Chancellor” -> Political office (Concept)
- “Debt” -> Economic concept (Concept)
- “Eurozone” -> Economic area (Place)
• We needed a way to find relationships
between Entities
tumra.com
14. THE BIG IDEA
Graphs allow us to store relationships between entities, and
graph algorithms allow us to interrogate those connections…
15. GRAPH DATABASES
Graph
Neo4J
Lab
Apache Golden
Giraph Orb
… of course there are many more open-source & proprietary ones
tumra.com
16. SO, WHICH ONE?
???
… it had to be fast, scalable, active development
tumra.com
17. STEP 2 BUILDING RELATIONSHIPS
We had 250 million Nodes, and 4 billion Edges…
great initial results but horrendously inefficient!
Example: “David Cameron” & “Angela Merkel”
tumra.com
18.
19.
20. INITIAL IMPROVEMENTS
• We didn’t need everything… just:
- People: “David Cameron”, “Angela Merkel”
- Places: “London”, “Downing Street”, “Eurozone”
- Concepts: “Debt”, “President”, “Eurozone”
- Things: Companies, Products etc.
• Pruned the graph using Map/Reduce
• This reduced the number of Entities…
- … but we still had billions of connections
tumra.com
21. EXAMPLE PEOPLE, PLACES, CONCEPTS
“David Cameron and the German
Chancellor Angela Merkel meets to
discuss the debt crisis and signal
their approval for greater eurozone
integration.”
tumra.com
22. EXAMPLE PEOPLE, PLACES, CONCEPTS
“David Cameron and the German
Chancellor Angela Merkel meets to
discuss the debt crisis and signal
their approval for greater eurozone
integration.”
Concepts Places
People
tumra.com
23. DISAMBIGUATION
Angela
Merkel
David
Cameron
(painter) Living
Person Politician
Head of
State
David
Cameron David
(footballer) David
Cameron Cameron
(actor) (politician)
Possibilities: shortest path, number of common connections etc.
24. STEP 3 SIMPLIFYING THE GRAPH
Sure all that extra metadata was tasty but we didn’t
need it all to solve the use-case…
So we used Map/Reduce to count the common
connections
tumra.com
25. SIMPLIFIED
Angela
Merkel
David
Cameron
(painter)
1
3
1
David
Cameron David
(footballer) David
Cameron Cameron
(actor) (politician)
Woah … that looks a lot like Least Cost Routing problem
26. LEAST COST PATH
Angela
Merkel
David
Cameron
(painter)
1/1
1/3
1/1
David
Cameron David
(footballer) David
Cameron Cameron
(actor) (politician)
1 / number of common connections = cost
27. UPDATED SOLUTION
Neo4J NoSQL
Unstructured
Disambiguation Awesomeness!
Data
NER
tumra.com
28. RECAP
• Graphs allow us to interrogate relationships
- Disambiguate when faced with multiple possibilities
- Infer more about the context of what’s happening
• Went through iterations of improvements
- Kept our Entity data in NoSQL = TB’s
- Used the Graph as an index of sorts = GB’s
• Neo4j was a great fit for our needs
tumra.com
29. STEP 4 MAKING IT WORK REAL-TIME
Some queries were taking ‘seconds’ and we needed
to go a lot faster because TV wont wait for us …
Do we really need to check the Graph everytime?
tumra.com
30. ENTER MACHINE LEARNING
• We can use simple predictors to estimate
the likelihood of Entities occurring
- i.e. every time we’ve looked for “David Cameron” in
the past the best match was the Politician
• Keeping a ‘probabilistic context’ of recent
Entities allows us to detect shifts in topics
- Works especially well on News channels
- Reduces the demand on Graph lookups
tumra.com
31. BAYES THEOREM
Looks complicated, but its basically just counting & division
Photo Credit: mattbuck007 on Flickr (cc)
32. STEP 5 MAKING IT WORK WORLDWIDE
We solved the problem for English, but what about
other languages?
tumra.com
33. LANGUAGE
• Our core Entities of ‘People’, ‘Places’, &
‘Concepts’ are language agnostic…
• We needed a way to ditch ‘language’ and
jump straight to entities…
- The colour ‘Red’ means the same thing regardless of
you calling it ‘Rot’, ‘Rouge’ or ‘赤’
• Again, Graphs could solve the problem
tumra.com
35. PROBLEM SOLVED
Typical response time ~30ms … relevancy improves
over time and learns new entities ‘online’
tumra.com
36. FINAL SOLUTION
Neo4J NoSQL
Unstructured Language Model Disambiguation
Awesomeness!
Data
Machine Learning
NER
tumra.com
37. ABOUT US
• We’ve built a product…
- Our ‘Digital Marketing Optimization’ platform
improves conversion rates & customer satisfaction
for eCommerce & Marketing campaigns
- Launches Q1 2013
• What else do we do?
- ‘Big Data’ & ‘Data Science’ professional services
- Bespoke prototype & solution development
“TUMRA” is a transliteration of the Sanskrit word for “BIG”;
we thought it’s a great name … ( and the .COM was available )
tumra.com
38. TUMRA
You?
THANKS FOR LISTENING
We’re hiring!
Data Scientists & Developers
work@tumra.com
tumra.com
39. THANKS FOR LISTENING
Questions?
tumra.com
hello@tumra.com
twitter.com/tumra
tumra.com