SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Helping travelers make better hotel choices
500 million times a month*
Steffen Wenz, CTO TrustYou
For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?
✓ Excellent hotel!
✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
DBCrawling
Semantic
Analysis
TrustYou
Analytics
API
Kayak...
TrustYou Architecture
200 million
reqs/month
Crawling
/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup
/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
/find?
q=Munich&pa
ge=99999999
...
… if only it were so easy
facebok.
com/meetup
Seed URLs
Frontier
Scrapy
● Build your own web crawlers
○ Extract data via CSS selectors, XPath, regexes …
○ Handles queuing, request parallelism, cookies,
throttling …
● Comprehensive and well-designed
● Commercial support by http://scrapinghub.com/
Frontier
Seed URLs
Intro to Scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m
Try it out!
$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups
(Note: Meetup also has an API …)
Number of registered meetups
Crawling at TrustYou scale
● 2 - 3 million new reviews/week
● Customers want alerts 8 - 24h
after review publication!
● Smart crawl frequency & depth,
but still high overhead
● Pools of constantly refreshed
EC2 proxy IPs
● Direct API connections with
many sites
Crawling at TrustYou scale
● Custom framework very similar to scrapy
● Runs on Hadoop cluster (100 nodes)
● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting
○ Coordination/messaging between nodes required:
■ Distributed queue
■ Rate limiting
Textual Data
Treating textual data
raw text sentence
splitting
stopword
filtering
stemming
tokenization
Tokenization
>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
info@pydata.berlin."
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']
“great rooms”
“great hotel”
“rooms are terrible”
“hotel is terrible”
JJ NN
JJ NN
NN VB JJ
NN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
WordNet
>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/
● “Nice room”
● “Room wasn‘t so great”
● “The air-conditioning
was so powerful that we
were cold in the room
even when it was off.”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● 20 languages
● Linguistic system
(morphology, taggers,
grammars, parsers …)
● Hadoop: Scale out CPU
○ ~1B opinions in DB
● Python for ML & NLP
libraries
Semantic Analysis at TrustYou
Word2Vec
● Map words to vectors
● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers
Workflow Management
& Scaling Up
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
● Can be combined with Pig, Hive
Luigi
class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefiles
data/my_task_output: DependentTask
run
run
run ...
class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task
Luigi dependency graphs
Hadoop!
● MapReduce: Programming model for distributed
computation problems
● Express your algorithm as sequences of operations:
a. Map: Do a linear pass over your data, emit (k, v)
b. (Distributed sort)
c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi
(Just go learn PySpark instead)
Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in
distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop
moves your code on machines
3. mapper() gets run (distributed)
4. Data gets re-sorted by key
5. reducer() gets run (distributed)
6. Output gets saved in HDFS
● Batch, never real time
● Slow even for batch
(lots of disk IO)
● Limited expressiveness
(remedies/crutches:
MRJob, Pig, Hive)
● Spark: More complete
Python support
Beyond MapReduce
Workflows at TrustYou
Workflows at TrustYou
We’re hiring! steffen@trustyou.com

Weitere ähnliche Inhalte

Was ist angesagt?

When RegEx is not enough
When RegEx is not enoughWhen RegEx is not enough
When RegEx is not enoughNati Cohen
 
Naughty And Nice Bash Features
Naughty And Nice Bash FeaturesNaughty And Nice Bash Features
Naughty And Nice Bash FeaturesNati Cohen
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stmAlexander Granin
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015Michiel Borkent
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the webMichiel Borkent
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
 
Why my Go program is slow?
Why my Go program is slow?Why my Go program is slow?
Why my Go program is slow?Inada Naoki
 
Full-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.jsFull-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.jsMichael Lehmann
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsПродвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsFDConf
 
Assignment no39
Assignment no39Assignment no39
Assignment no39Jay Patel
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Gregg Donovan
 

Was ist angesagt? (20)

When RegEx is not enough
When RegEx is not enoughWhen RegEx is not enough
When RegEx is not enough
 
Naughty And Nice Bash Features
Naughty And Nice Bash FeaturesNaughty And Nice Bash Features
Naughty And Nice Bash Features
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stm
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
Python GC
Python GCPython GC
Python GC
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
Why my Go program is slow?
Why my Go program is slow?Why my Go program is slow?
Why my Go program is slow?
 
Full-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.jsFull-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.js
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsПродвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
 
Assignment no39
Assignment no39Assignment no39
Assignment no39
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
node ffi
node ffinode ffi
node ffi
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
 

Andere mochten auch

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012TrustYou
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture BookTrustYou
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020Steffen Wenz
 
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation Eike Pierstorff
 
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...Die Socialisten
 
Get ready for SharePoint 2016
Get ready for SharePoint 2016Get ready for SharePoint 2016
Get ready for SharePoint 2016Next Iteration
 
Designing For Smarties
Designing For SmartiesDesigning For Smarties
Designing For SmartiesReadWrite
 
Securing the Cloud
Securing the CloudSecuring the Cloud
Securing the CloudGGV Capital
 
10 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 201610 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 2016Eventbrite UK
 
DESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UXDESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UXPeter Rozek
 
Forgotten women in tech history.
Forgotten women in tech history.Forgotten women in tech history.
Forgotten women in tech history.Domo
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudDeanta
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
 
Pollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending BusinessPollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending BusinessPollen VC
 
Developing an Intranet Strategy
Developing an Intranet StrategyDeveloping an Intranet Strategy
Developing an Intranet StrategyDNN
 

Andere mochten auch (18)

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture Book
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
 
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
 
Get ready for SharePoint 2016
Get ready for SharePoint 2016Get ready for SharePoint 2016
Get ready for SharePoint 2016
 
Session 1 | «Von der Strategie zum Image.» | Hello Apple
Session 1 | «Von der Strategie zum Image.» | Hello AppleSession 1 | «Von der Strategie zum Image.» | Hello Apple
Session 1 | «Von der Strategie zum Image.» | Hello Apple
 
Designing For Smarties
Designing For SmartiesDesigning For Smarties
Designing For Smarties
 
Social digital tech trends 2016
Social digital tech trends 2016 Social digital tech trends 2016
Social digital tech trends 2016
 
Securing the Cloud
Securing the CloudSecuring the Cloud
Securing the Cloud
 
NYU Talk
NYU TalkNYU Talk
NYU Talk
 
10 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 201610 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 2016
 
DESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UXDESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UX
 
Forgotten women in tech history.
Forgotten women in tech history.Forgotten women in tech history.
Forgotten women in tech history.
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the Cloud
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
 
Pollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending BusinessPollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending Business
 
Developing an Intranet Strategy
Developing an Intranet StrategyDeveloping an Intranet Strategy
Developing an Intranet Strategy
 

Ähnlich wie PyData Berlin Meetup

Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
MongoDB at ZPUGDC
MongoDB at ZPUGDCMongoDB at ZPUGDC
MongoDB at ZPUGDCMike Dirolf
 
.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#Bertrand Le Roy
 
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016Codemotion
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013Andy Davies
 
Immutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS LambdaImmutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS LambdaAOE
 
Mongodb intro
Mongodb introMongodb intro
Mongodb introchristkv
 
Rust: Systems Programming for Everyone
Rust: Systems Programming for EveryoneRust: Systems Programming for Everyone
Rust: Systems Programming for EveryoneC4Media
 
Node.js - async for the rest of us.
Node.js - async for the rest of us.Node.js - async for the rest of us.
Node.js - async for the rest of us.Mike Brevoort
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introductionantoinegirbal
 
2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterbPierre Souchay
 

Ähnlich wie PyData Berlin Meetup (20)

Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
MongoDB at ZPUGDC
MongoDB at ZPUGDCMongoDB at ZPUGDC
MongoDB at ZPUGDC
 
.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#
 
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013
 
Immutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS LambdaImmutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS Lambda
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Rust: Systems Programming for Everyone
Rust: Systems Programming for EveryoneRust: Systems Programming for Everyone
Rust: Systems Programming for Everyone
 
Node.js - async for the rest of us.
Node.js - async for the rest of us.Node.js - async for the rest of us.
Node.js - async for the rest of us.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Node azure
Node azureNode azure
Node azure
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Modern C++
Modern C++Modern C++
Modern C++
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb
 

Kürzlich hochgeladen

DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Onlineanilsa9823
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 

Kürzlich hochgeladen (20)

DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 

PyData Berlin Meetup

  • 1. Helping travelers make better hotel choices 500 million times a month* Steffen Wenz, CTO TrustYou
  • 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  • 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  • 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  • 6.
  • 7.
  • 8.
  • 13. Scrapy ● Build your own web crawlers ○ Extract data via CSS selectors, XPath, regexes … ○ Handles queuing, request parallelism, cookies, throttling … ● Comprehensive and well-designed ● Commercial support by http://scrapinghub.com/
  • 14. Frontier Seed URLs Intro to Scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class MySpider(CrawlSpider): name = "my_spider" # start with this URL start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"] # follow these URLs, and call self.parse_meetup to extract data from them rules = [ Rule(LinkExtractor(allow=[ "^http://www.meetup.com/[^/]+/$", ]), callback="parse_meetup"), ] def parse_meetup(self, response): # Extract data about meetup from HTML m = MeetupItem() yield m
  • 15. Try it out! $ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null {"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy - Berlin", "members": "774"} {"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"} {"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"} {"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"} {"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin", "members": "1"} {"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight Catch Up", "members": "1"} ... Full code on GitHub, dump of all Berlin meetups (Note: Meetup also has an API …)
  • 17. Crawling at TrustYou scale ● 2 - 3 million new reviews/week ● Customers want alerts 8 - 24h after review publication! ● Smart crawl frequency & depth, but still high overhead ● Pools of constantly refreshed EC2 proxy IPs ● Direct API connections with many sites
  • 18. Crawling at TrustYou scale ● Custom framework very similar to scrapy ● Runs on Hadoop cluster (100 nodes) ● … Though problem not 100% suitable for MapReduce ○ Nodes mostly waiting ○ Coordination/messaging between nodes required: ■ Distributed queue ■ Rate limiting
  • 20. Treating textual data raw text sentence splitting stopword filtering stemming tokenization
  • 21. Tokenization >>> import nltk >>> raw = "We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers. Please get in touch using info@pydata.berlin." >>> nltk.sent_tokenize(raw) ['We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers.', 'Please get in touch using info@pydata. berlin.'] >>> nltk.word_tokenize(raw) ['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',', 'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic', 'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@', 'pydata.berlin', '.']
  • 22. “great rooms” “great hotel” “rooms are terrible” “hotel is terrible” JJ NN JJ NN NN VB JJ NN VB JJ Grammars and Parsing >>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  • 23. >>> grammar = nltk.CFG.fromstring(""" ... OPINION -> NN COP JJ ... OPINION -> JJ NN ... NN -> 'hotel' | 'rooms' ... COP -> 'is' | 'are' ... JJ -> 'great' | 'terrible' ... """) >>> parser = nltk.ChartParser(grammar) >>> sent = nltk.word_tokenize("great rooms") >>> for tree in parser.parse(sent): >>> print tree (OPINION (ADJ great) (NOUN rooms)) Grammars and Parsing
  • 24. WordNet >>> from nltk.corpus import wordnet as wn >>> wn.morphy('coded', wn.VERB) 'code' >>> wn.synsets("python") [Synset('python.n.01'), Synset('python.n.02'), Synset('python.n. 03')] >>> wn.synset('python.n.01').hypernyms() [Synset('boa.n.02')] >>> # meh :/
  • 25. ● “Nice room” ● “Room wasn‘t so great” ● “The air-conditioning was so powerful that we were cold in the room even when it was off.” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● 20 languages ● Linguistic system (morphology, taggers, grammars, parsers …) ● Hadoop: Scale out CPU ○ ~1B opinions in DB ● Python for ML & NLP libraries Semantic Analysis at TrustYou
  • 26. Word2Vec ● Map words to vectors ● “Step up” from bag-of- words model ● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts >>> m["python"] array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709, -0.0200, -0.0325, 0.0166, 0.3312, -0.0928, -0.0967, -0.0199, -0.2498, -0.4445, -0.0445, # ... -1.0090, -0.2553, 0.2686, -0.4121, 0.3116, -0.0639, -0.3688, -0.0273, -0.1266, -0.2606, -0.1549, 0.0023, 0.0084, 0.2169, 0.0060], dtype=float32)
  • 27. Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django', 0.8189617991447449)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["berlin"])[:3] [(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland', 0.7970746755599976)] >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
  • 28. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used - together with other features - for various classifiers
  • 30. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Some support for Hadoop ● Pythonic replacement for Oozie ● Can be combined with Pig, Hive Luigi
  • 31. class MyTask(luigi.Task): def requires(self): return DependentTask() def output(self): return luigi.LocalTarget("data/my_task_output")) def run(self): with self.output().open("w") as out: out.write("foo") Luigi tasks vs. Makefiles data/my_task_output: DependentTask run run run ...
  • 32. class CrawlTask(luigi.Task): city = luigi.Parameter() def output(self): output_path = os.path.join("data", "{}.jsonl".format(self.city)) return luigi.LocalTarget(output_path) def run(self): tmp_output_path = self.output().path + "_tmp" subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}". format(self.city), "-o", tmp_output_path, "-t", "jsonlines"]) os.rename(tmp_output_path, self.output().path) Example: Wrap crawl in Luigi task
  • 34. Hadoop! ● MapReduce: Programming model for distributed computation problems ● Express your algorithm as sequences of operations: a. Map: Do a linear pass over your data, emit (k, v) b. (Distributed sort) c. Reduce: Linear pass over all (k, v) for the same k ● Python on Hadoop: Hadoop streaming, MRJob, Luigi (Just go learn PySpark instead)
  • 35. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values)
  • 36. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values) 1. Your input data is sitting in distributed file system (HDFS) 2. Luigi creates a .tar.gz, Hadoop moves your code on machines 3. mapper() gets run (distributed) 4. Data gets re-sorted by key 5. reducer() gets run (distributed) 6. Output gets saved in HDFS
  • 37. ● Batch, never real time ● Slow even for batch (lots of disk IO) ● Limited expressiveness (remedies/crutches: MRJob, Pig, Hive) ● Spark: More complete Python support Beyond MapReduce