Slides from LA Algorithmic Trading event (http://www.meetup.com/LA-Algorithmic-Trading/events/98963812/) on using BigData and Algorithms in your business. Covers how What'sGood uses algorithms to allow users to make choices about food on the go and the "BigData" infrastructure we've built to support them.
Includes topics such as"big" data ingestion, in-stream processing, NLP algorithms, assessing "popularity", assigning relevancy weights in search, adding "dimensionality" to restaurant menus by cleansing public data sets, and mapping loosely correlated dataset into your own.
3. Welcome
Tim Shea tim@whatsgood.com @sheanineseven
Data Scientist
Ad Agency Guy (Razorfish, Universal, TBWAChiatDay)
Founder and CTO of WhatsGood.com
Big interest in convergence of Tech and Finance communities
4. Elevator Pitch
Digital Menu Platform for picky eaters on-the-go.
Data-centric POV
Search/Sort/Slice/Dice, Answering “What’s Good Here”?
The “Good” in WhatsGood varies by person.
5. “Dimensionality”
Hundreds of data points *behind* each menu item.
This data is *hidden* by traditional analog menus.
Dimensionality = Personalization.
9. Problem
Noise
80/20 - In any scenario where you’re ordering food (ex. at-
home, in-restaurant, etc) 80% of menu info is noise.
Bad In-store. Worse when considering multiple locations.
Paper menus dont help this situation at all.
10. Result
Human error.
Leads to:
Frustration - “Ill just get what I usually get”
Alienation - “I’m going out with my meat-eating friends, Ill just bring a granola bar”
Accidents - “The waiter didnt know there was soy sauce in there, and I ended up in
the hospital”
11. Hypothesis
BigData + Machine Learning + The Crowd
Will remove these pain points.
And create something truly valuable for people.
Literally improve the way we discover food, permanantly.
15. ClydeStorm
Menu Ingestion - Every 2 weeks, reconcile 400,000
Restaurants and 50MM Menu Items (Add/Edit/Delete)
NLP Classifiers - Then, for every dish, we run 8 NLP
classifiers to determine (V,G,N,L,P,&Pop)
Data Mapping - Orthoginal datasets that “dont quite fit”
Search - Handles all the modern indexing and retrieval
operations consumers are accustomed to.
16. Vegas8
Based on a simple human Intuition:
“Signal Words” helps us make 1 of 3 determinations:
1. Definitely Positive - “Vegan”: All bets are off, obviously vegan.
2. Strongly Negative - “Ribeye Steak”: Pretty damn confident, not vegan.
3. Fuzzy Signal - Not enough info, conflicting info, fuzzy signal.
18. FoodNet
Based loosely on WordNet - Open Source Princeton project
Lexical Knowledge Graph or word relations (vs a list)
ex. Obviously “MILK” is a signal for “Contains Lactose”
But so are all of its other permutations:
- Synonyms
- Hyper- & Hypo-nyms
- Other languages
- All the foods in the world that commonly use MILK as an ingredient
20. First Version
Read from Menu DB - 50MM Venue, Dish Title & Description
Read from Synonym DB - Slam it into a big RegEx
For Each record - Any matches?
Save Results
23. Stepping Back
How do we find better tools for the job?
How do we measure any improvements we make?
Is there a more “Algorithmic” approach?
Such as Machine Learning in general, or NLP specifically?
25. What is NLP?
Natural Language Processing
Attempt to formalize the ways in which humans understand
language, into a computer program.
Slippery - We’re not accustomed to thinking about how we
understand each other, we just do it.
26. Widely Applicable
Semantic Analysis - Whats the overall mood here?
Text Classification - What is this document I’m reading?
Knowledge Mapping - Which things relate to which?
Info Extraction - What are the major topics discussed?
32. Orthogonality
Rhombus - The What’sGood Decoder Ring
Library that attempts to resolve “Matching Problems”
For Example: Public Calorie Database - Can I even use it?
35. “Bag of Words”
Type of Naive Bayes Classifer
Tokenize
Remove Stop Words
Stemming the remaining words
Frequency Distribution - How many times did this occur?
36. Edge Cases
Yelp Review - Comme Ca
“You’d expect a place with such a diverse selection of french food,
wonderfully accomodating staff, and a world class chef to live up to its
amazing reputation, but it just simply did not.”
38. Humans!!
National Weather Service
Tries to quantify the effect of humans:
- Precipitation forecasts - 25% lift
- Temperature forecasts - 10% lift
Traders
Need human judgement when a model is failing.
40. Popularity Algorithm
“Social Triangulation”
(A * (# star ratings)
+
B * (# of dish mentions/total reviews at restaurant)
+
C * (# of photos/avg mentions per restaurant in specific
geography)
) * Arbitrary population weight
41. Search Weights
Which signals are more important:
Number of times your search query matched something?
Your previous searches & behaviors?
Does Proximity to you outweigh other factors?
Does Popularity?
49. Trading Parallels
Dynamic vs Static Systems
Knowledge/Signal Graph
If you’re monitoring “Apple” youll need to monitor:
- Apple, $APPL, Tim Cooke, iPhone, FOXCONN
- And assign a signal weight and signal vector for each
Orthogonality
Using loosely correlative systems
50. Data Science
Burgeoning skill set:
Data
Programmer
Sys admin
Full stack knowledge
Stats
Probability
Algorithms
Empirical methodology
Business
“Real world” knowledge
Subjectivity
Modeling uncertainty