Helping travelers make better hotel choices - 500 million times a month
TrustYou analyzes online hotel reviews to create a summary for every hotel in the world. What do travelers think of the service? Is this hotel suitable for business travelers? TrustYou data is integrated on countless websites (Trivago, Wego, Kayak), helping travelers make better choices. Try it out yourself on http://www.trust-score.com/
TrustYou runs almost exclusively on Python. Every week, we find 3 million new hotel reviews on the web, process them, analyze the text using Natural Language Processing, and update our database of 600,000 hotels. In this talk, Steffen will give insights into how Python is used at TrustYou to collect, analyze and visualize these large amounts of data.
4. ✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
5. ✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
13. Scrapy
● Build your own web crawlers
○ Extract data via CSS selectors, XPath, regexes …
○ Handles queuing, request parallelism, cookies,
throttling …
● Comprehensive and well-designed
● Commercial support by http://scrapinghub.com/
14. Frontier
Seed URLs
Intro to Scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m
15. Try it out!
$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups
(Note: Meetup also has an API …)
17. Crawling at TrustYou scale
● 2 - 3 million new reviews/week
● Customers want alerts 8 - 24h
after review publication!
● Smart crawl frequency & depth,
but still high overhead
● Pools of constantly refreshed
EC2 proxy IPs
● Direct API connections with
many sites
18. Crawling at TrustYou scale
● Custom framework very similar to scrapy
● Runs on Hadoop cluster (100 nodes)
● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting
○ Coordination/messaging between nodes required:
■ Distributed queue
■ Rate limiting
21. Tokenization
>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
info@pydata.berlin."
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']
22. “great rooms”
“great hotel”
“rooms are terrible”
“hotel is terrible”
JJ NN
JJ NN
NN VB JJ
NN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
23. >>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
25. ● “Nice room”
● “Room wasn‘t so great”
● “The air-conditioning
was so powerful that we
were cold in the room
even when it was off.”
● “อาหารรสชาติดี”
● “ﺟﯾدة ﺧدﻣﺔ ”
● 20 languages
● Linguistic system
(morphology, taggers,
grammars, parsers …)
● Hadoop: Scale out CPU
○ ~1B opinions in DB
● Python for ML & NLP
libraries
Semantic Analysis at TrustYou
26. Word2Vec
● Map words to vectors
● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
30. ● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
● Can be combined with Pig, Hive
Luigi
31. class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefiles
data/my_task_output: DependentTask
run
run
run ...
32. class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task
34. Hadoop!
● MapReduce: Programming model for distributed
computation problems
● Express your algorithm as sequences of operations:
a. Map: Do a linear pass over your data, emit (k, v)
b. (Distributed sort)
c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi
(Just go learn PySpark instead)
35. Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
36. Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in
distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop
moves your code on machines
3. mapper() gets run (distributed)
4. Data gets re-sorted by key
5. reducer() gets run (distributed)
6. Output gets saved in HDFS
37. ● Batch, never real time
● Slow even for batch
(lots of disk IO)
● Limited expressiveness
(remedies/crutches:
MRJob, Pig, Hive)
● Spark: More complete
Python support
Beyond MapReduce