SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Algorithmic
     Trading
(In Los Angeles)
LA Algorithmic Trading Meetup - Winter 2013
What’s Good?
Welcome
Tim Shea  tim@whatsgood.com  @sheanineseven


 Data Scientist


 Ad Agency Guy (Razorfish, Universal, TBWAChiatDay)


 Founder and CTO of WhatsGood.com



 Big interest in convergence of Tech and Finance communities
Elevator Pitch
Digital Menu Platform for picky eaters on-the-go.


Data-centric POV


Search/Sort/Slice/Dice, Answering “What’s Good Here”?


The “Good” in WhatsGood varies by person.
“Dimensionality”
Hundreds of data points *behind* each menu item.


This data is *hidden* by traditional analog menus.


Dimensionality = Personalization.
Thomas Guide
 Thomas Guide :: Google Earth


             As


  Paper Menu :: What’sGood
Hypothesis
We’re Empiricists
Problem
Noise

80/20 - In any scenario where you’re ordering food (ex. at-
home, in-restaurant, etc) 80% of menu info is noise.


Bad In-store. Worse when considering multiple locations.


Paper menus dont help this situation at all.
Result
Human error.


Leads to:


Frustration - “Ill just get what I usually get”

Alienation - “I’m going out with my meat-eating friends, Ill just bring a granola bar”

Accidents - “The waiter didnt know there was soy sauce in there, and I ended up in
the hospital”
Hypothesis
BigData + Machine Learning + The Crowd


Will remove these pain points.



And create something truly valuable for people.



Literally improve the way we discover food, permanantly.
What’sGood Algos
ClydeStorm
Components


FoodNet + Vegas8 + Rhombus
ClydeStorm
Menu Ingestion - Every 2 weeks, reconcile 400,000
Restaurants and 50MM Menu Items (Add/Edit/Delete)


NLP Classifiers - Then, for every dish, we run 8 NLP
classifiers to determine (V,G,N,L,P,&Pop)


Data Mapping - Orthoginal datasets that “dont quite fit”


Search - Handles all the modern indexing and retrieval
operations consumers are accustomed to.
Vegas8
Based on a simple human Intuition:


“Signal Words” helps us make 1 of 3 determinations:


1. Definitely Positive - “Vegan”: All bets are off, obviously vegan.

2. Strongly Negative - “Ribeye Steak”: Pretty damn confident, not vegan.

3. Fuzzy Signal - Not enough info, conflicting info, fuzzy signal.
The Intuition
FoodNet
Based loosely on WordNet - Open Source Princeton project


Lexical Knowledge Graph or word relations (vs a list)

ex. Obviously “MILK” is a signal for “Contains Lactose”


But so are all of its other permutations:


			- Synonyms
			      - Hyper- & Hypo-nyms
			- Other languages
			      - All the foods in the world that commonly use MILK as an ingredient
First Attempt
First Version
Read from Menu DB - 50MM Venue, Dish Title & Description


Read from Synonym DB - Slam it into a big RegEx


For Each record - Any matches?


Save Results
Results?
Medoicre

- Took forever to run
- Unexpected results (think: RegEx)
- Tons of edge cases
Algorithms and NLP
Stepping Back
How do we find better tools for the job?

How do we measure any improvements we make?

Is there a more “Algorithmic” approach?

Such as Machine Learning in general, or NLP specifically?
Not #NLP
*Not* Nuero Linguistic Programming
What is NLP?
Natural Language Processing


Attempt to formalize the ways in which humans understand
language, into a computer program.


Slippery - We’re not accustomed to thinking about how we
understand each other, we just do it.
Widely Applicable
Semantic Analysis - Whats the overall mood here?


Text Classification - What is this document I’m reading?


Knowledge Mapping - Which things relate to which?


Info Extraction - What are the major topics discussed?
What’sGood Use Cases
1. Similarity
Are these the same?

  A Frame, 12565 W Washington Blvd, Culver City, CA, 90066



  A-Frame, 12565 Washington Blvd, Los Angeles, CA, 90066
This Problem
         Creme Brulee
              vs
         Crème brûlée
              vs
       Cr�e Bru001lee
This Problem
Orthogonality
Rhombus - The What’sGood Decoder Ring


Library that attempts to resolve “Matching Problems”	


For Example: Public Calorie Database - Can I even use it?
TextGrounder
Disambiguate:
- Georgia vs Georgia


Context:
- Melrose Heights vs West Hollywood vs Los Angeles
2. Sentiment
“Bag of Words”
Type of Naive Bayes Classifer
	
	Tokenize


	 Remove Stop Words


	 Stemming the remaining words


	 Frequency Distribution - How many times did this occur?
Edge Cases
Yelp Review - Comme Ca


“You’d expect a place with such a diverse selection of french food,
wonderfully accomodating staff, and a world class chef to live up to its
amazing reputation, but it just simply did not.”
Other Great Tricks
Part of Speech Tagging


N-Grams


Levenshtein Distance


RevMiner
Humans!!
National Weather Service
	 Tries to quantify the effect of humans:
		 - Precipitation forecasts - 25% lift
		 - Temperature forecasts - 10% lift



Traders
	 Need human judgement when a model is failing.
3. Relevancy
Popularity Algorithm
“Social Triangulation”
(A * (# star ratings)
 +
 B * (# of dish mentions/total reviews at restaurant)
 +
 C * (# of photos/avg mentions per restaurant in specific
 geography)
) * Arbitrary population weight
Search Weights
Which signals are more important:


 Number of times your search query matched something?

 						

 Your previous searches & behaviors?



 Does Proximity to you outweigh other factors?



 Does Popularity?
Infrastructure
Stack
Running on Windows


Web/REST Tier in the Cloud


Dedicated RDMS on Solid State Drives


C# & SQL Server		


Python & NLTK


Solr Lucene
Results
“Vegas8 - RegEx 1”
Raw RegEx, RackSpace Cloud, Shared CPU


 5 classifiers

 ~1 record/sec

 50MM Records = 50MM Seconds

 14,000 Hours


 ~578 Days
Results
Results                            Sec/Record   Total Sec    Total Hours Total Days
RegEx 1                            1            50,000,000   13,888.89   578.70
Tokenize 1                         25           2,000,000    555.56      23.15
Tokenize 2 (SSD & dedicated CPU)   110          454,545      126.26      5.26
Tokenize 3 with 50MM caching       16           3,125,000    868.06      36.17
Tokenize 4 with 10K caching        225          222,222      61.73       2.57
Token/Stem/Stop                    230          217,391      60.39       2.52
Token/Stem/Stop w/ 4 parallel pro- 874          57,208       15.89       0.66
cesses
Levenstein/Weights/Biz Rules/Ha-   ??           ??           ??          ??
doop
Improvements?
Serialization - eats ~40-60% of overhead. How do we
remove it?


Dedicated Hardware - SSDs & Dedicated CPU


Parallelization - Hadoop? RightScale? Custom Solution?


Indexing - SQL “dumb” storage. Solr for search.
Experts
Panel of Resident Nutritionists

Formalizing things like:

 “What is Hangover Food“

 “How to get Huge fast”

 “How to be a really annoying Yogi”
Final Thoughts
Trading Parallels
Dynamic vs Static Systems



Knowledge/Signal Graph

		 If you’re monitoring “Apple” youll need to monitor:

		 - Apple, $APPL, Tim Cooke, iPhone, FOXCONN

		 - And assign a signal weight and signal vector for each



Orthogonality

		 Using loosely correlative systems
Data Science
Burgeoning skill set:

Data
 Programmer
 Sys admin
 Full stack knowledge
Stats
 Probability
 Algorithms
 Empirical methodology
Business
 “Real world” knowledge
 Subjectivity
 Modeling uncertainty
Resources
Data Science Toolkit


NLTK


Nate Silver - The Signal and the Noise
Tim Shea
 @sheanineseven
tim@whatsgood.com

Weitere ähnliche Inhalte

Ähnlich wie BigData and Algorithms - LA Algorithmic Trading

2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleSteve Karam
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Elad Rosenheim
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnKodok Ngorex
 
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...DevOps.com
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsNatalino Busa
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein AlignmentCloudera, Inc.
 
The Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingJuan J. Merelo
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview MirrorMichelle Brush
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big FunHilary Mason
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performanceBryan Farrow
 

Ähnlich wie BigData and Algorithms - LA Algorithmic Trading (20)

2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in Oracle
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
 
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
 
The Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms Programming
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Data Management
Data ManagementData Management
Data Management
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

BigData and Algorithms - LA Algorithmic Trading

  • 1. Algorithmic Trading (In Los Angeles) LA Algorithmic Trading Meetup - Winter 2013
  • 3. Welcome Tim Shea tim@whatsgood.com @sheanineseven Data Scientist Ad Agency Guy (Razorfish, Universal, TBWAChiatDay) Founder and CTO of WhatsGood.com Big interest in convergence of Tech and Finance communities
  • 4. Elevator Pitch Digital Menu Platform for picky eaters on-the-go. Data-centric POV Search/Sort/Slice/Dice, Answering “What’s Good Here”? The “Good” in WhatsGood varies by person.
  • 5. “Dimensionality” Hundreds of data points *behind* each menu item. This data is *hidden* by traditional analog menus. Dimensionality = Personalization.
  • 6. Thomas Guide Thomas Guide :: Google Earth As Paper Menu :: What’sGood
  • 9. Problem Noise 80/20 - In any scenario where you’re ordering food (ex. at- home, in-restaurant, etc) 80% of menu info is noise. Bad In-store. Worse when considering multiple locations. Paper menus dont help this situation at all.
  • 10. Result Human error. Leads to: Frustration - “Ill just get what I usually get” Alienation - “I’m going out with my meat-eating friends, Ill just bring a granola bar” Accidents - “The waiter didnt know there was soy sauce in there, and I ended up in the hospital”
  • 11. Hypothesis BigData + Machine Learning + The Crowd Will remove these pain points. And create something truly valuable for people. Literally improve the way we discover food, permanantly.
  • 15. ClydeStorm Menu Ingestion - Every 2 weeks, reconcile 400,000 Restaurants and 50MM Menu Items (Add/Edit/Delete) NLP Classifiers - Then, for every dish, we run 8 NLP classifiers to determine (V,G,N,L,P,&Pop) Data Mapping - Orthoginal datasets that “dont quite fit” Search - Handles all the modern indexing and retrieval operations consumers are accustomed to.
  • 16. Vegas8 Based on a simple human Intuition: “Signal Words” helps us make 1 of 3 determinations: 1. Definitely Positive - “Vegan”: All bets are off, obviously vegan. 2. Strongly Negative - “Ribeye Steak”: Pretty damn confident, not vegan. 3. Fuzzy Signal - Not enough info, conflicting info, fuzzy signal.
  • 18. FoodNet Based loosely on WordNet - Open Source Princeton project Lexical Knowledge Graph or word relations (vs a list) ex. Obviously “MILK” is a signal for “Contains Lactose” But so are all of its other permutations: - Synonyms - Hyper- & Hypo-nyms - Other languages - All the foods in the world that commonly use MILK as an ingredient
  • 20. First Version Read from Menu DB - 50MM Venue, Dish Title & Description Read from Synonym DB - Slam it into a big RegEx For Each record - Any matches? Save Results
  • 21. Results? Medoicre - Took forever to run - Unexpected results (think: RegEx) - Tons of edge cases
  • 23. Stepping Back How do we find better tools for the job? How do we measure any improvements we make? Is there a more “Algorithmic” approach? Such as Machine Learning in general, or NLP specifically?
  • 24. Not #NLP *Not* Nuero Linguistic Programming
  • 25. What is NLP? Natural Language Processing Attempt to formalize the ways in which humans understand language, into a computer program. Slippery - We’re not accustomed to thinking about how we understand each other, we just do it.
  • 26. Widely Applicable Semantic Analysis - Whats the overall mood here? Text Classification - What is this document I’m reading? Knowledge Mapping - Which things relate to which? Info Extraction - What are the major topics discussed?
  • 29. Are these the same? A Frame, 12565 W Washington Blvd, Culver City, CA, 90066 A-Frame, 12565 Washington Blvd, Los Angeles, CA, 90066
  • 30. This Problem Creme Brulee vs Crème brûlée vs Cr�e Bru001lee
  • 32. Orthogonality Rhombus - The What’sGood Decoder Ring Library that attempts to resolve “Matching Problems” For Example: Public Calorie Database - Can I even use it?
  • 33. TextGrounder Disambiguate: - Georgia vs Georgia Context: - Melrose Heights vs West Hollywood vs Los Angeles
  • 35. “Bag of Words” Type of Naive Bayes Classifer Tokenize Remove Stop Words Stemming the remaining words Frequency Distribution - How many times did this occur?
  • 36. Edge Cases Yelp Review - Comme Ca “You’d expect a place with such a diverse selection of french food, wonderfully accomodating staff, and a world class chef to live up to its amazing reputation, but it just simply did not.”
  • 37. Other Great Tricks Part of Speech Tagging N-Grams Levenshtein Distance RevMiner
  • 38. Humans!! National Weather Service Tries to quantify the effect of humans: - Precipitation forecasts - 25% lift - Temperature forecasts - 10% lift Traders Need human judgement when a model is failing.
  • 40. Popularity Algorithm “Social Triangulation” (A * (# star ratings) + B * (# of dish mentions/total reviews at restaurant) + C * (# of photos/avg mentions per restaurant in specific geography) ) * Arbitrary population weight
  • 41. Search Weights Which signals are more important: Number of times your search query matched something? Your previous searches & behaviors? Does Proximity to you outweigh other factors? Does Popularity?
  • 43. Stack Running on Windows Web/REST Tier in the Cloud Dedicated RDMS on Solid State Drives C# & SQL Server Python & NLTK Solr Lucene
  • 44. Results “Vegas8 - RegEx 1” Raw RegEx, RackSpace Cloud, Shared CPU 5 classifiers ~1 record/sec 50MM Records = 50MM Seconds 14,000 Hours ~578 Days
  • 45. Results Results Sec/Record Total Sec Total Hours Total Days RegEx 1 1 50,000,000 13,888.89 578.70 Tokenize 1 25 2,000,000 555.56 23.15 Tokenize 2 (SSD & dedicated CPU) 110 454,545 126.26 5.26 Tokenize 3 with 50MM caching 16 3,125,000 868.06 36.17 Tokenize 4 with 10K caching 225 222,222 61.73 2.57 Token/Stem/Stop 230 217,391 60.39 2.52 Token/Stem/Stop w/ 4 parallel pro- 874 57,208 15.89 0.66 cesses Levenstein/Weights/Biz Rules/Ha- ?? ?? ?? ?? doop
  • 46. Improvements? Serialization - eats ~40-60% of overhead. How do we remove it? Dedicated Hardware - SSDs & Dedicated CPU Parallelization - Hadoop? RightScale? Custom Solution? Indexing - SQL “dumb” storage. Solr for search.
  • 47. Experts Panel of Resident Nutritionists Formalizing things like: “What is Hangover Food“ “How to get Huge fast” “How to be a really annoying Yogi”
  • 49. Trading Parallels Dynamic vs Static Systems Knowledge/Signal Graph If you’re monitoring “Apple” youll need to monitor: - Apple, $APPL, Tim Cooke, iPhone, FOXCONN - And assign a signal weight and signal vector for each Orthogonality Using loosely correlative systems
  • 50. Data Science Burgeoning skill set: Data Programmer Sys admin Full stack knowledge Stats Probability Algorithms Empirical methodology Business “Real world” knowledge Subjectivity Modeling uncertainty
  • 51. Resources Data Science Toolkit NLTK Nate Silver - The Signal and the Noise