PyData Berlin Meetup

Helping travelers make better hotel choices
500 million times a month*
Steffen Wenz, CTO TrustYou

For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?

✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »

✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)

DBCrawling
Semantic
Analysis
TrustYou
Analytics
API
Kayak...
TrustYou Architecture
200 million
reqs/month

/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup

/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
/find?
q=Munich&pa
ge=99999999
...
… if only it were so easy
facebok.
com/meetup
Seed URLs
Frontier

Scrapy
● Build your own web crawlers
○ Extract data via CSS selectors, XPath, regexes …
○ Handles queuing, request parallelism, cookies,
throttling …
● Comprehensive and well-designed
● Commercial support by http://scrapinghub.com/

Frontier
Seed URLs
Intro to Scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m

Try it out!
$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups
(Note: Meetup also has an API …)

Crawling at TrustYou scale
● 2 - 3 million new reviews/week
● Customers want alerts 8 - 24h
after review publication!
● Smart crawl frequency & depth,
but still high overhead
● Pools of constantly refreshed
EC2 proxy IPs
● Direct API connections with
many sites

Crawling at TrustYou scale
● Custom framework very similar to scrapy
● Runs on Hadoop cluster (100 nodes)
● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting
○ Coordination/messaging between nodes required:
■ Distributed queue
■ Rate limiting

Treating textual data
raw text sentence
splitting
stopword
filtering
stemming
tokenization

Tokenization
>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
info@pydata.berlin."
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']

“great rooms”
“great hotel”
“rooms are terrible”
“hotel is terrible”
JJ NN
JJ NN
NN VB JJ
NN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing

WordNet
>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/

● “Nice room”
● “Room wasn‘t so great”
● “The air-conditioning
was so powerful that we
were cold in the room
even when it was off.”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● 20 languages
● Linguistic system
(morphology, taggers,
grammars, parsers …)
● Hadoop: Scale out CPU
○ ~1B opinions in DB
● Python for ML & NLP
libraries
Semantic Analysis at TrustYou

Word2Vec
● Map words to vectors
● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)

Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]

ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers

Workflow Management
& Scaling Up

● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
● Can be combined with Pig, Hive
Luigi

class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefiles
data/my_task_output: DependentTask
run
run
run ...

class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task

Hadoop!
● MapReduce: Programming model for distributed
computation problems
● Express your algorithm as sequences of operations:
a. Map: Do a linear pass over your data, emit (k, v)
b. (Distributed sort)
c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi
(Just go learn PySpark instead)

Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)

Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in
distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop
moves your code on machines
3. mapper() gets run (distributed)
4. Data gets re-sorted by key
5. reducer() gets run (distributed)
6. Output gets saved in HDFS

● Batch, never real time
● Slow even for batch
(lots of disk IO)
● Limited expressiveness
(remedies/crutches:
MRJob, Pig, Hive)
● Spark: More complete
Python support
Beyond MapReduce

We’re hiring! steffen@trustyou.com

PyData Berlin Meetup

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie PyData Berlin Meetup

Ähnlich wie PyData Berlin Meetup (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PyData Berlin Meetup