Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Data Science at Tumblr
1. Data at Tumblr
Adam Laiacano
NYC Data Science Meetup
@adamlaiacano
adamlaiacano.tumblr.com
Monday, April 8, 13
2. What I Needed to Learn
When I Started My Job
Monday, April 8, 13
3. About Me
Electrical Engineering background
Worked at CBS to learn more about stats / data
Joined Tumblr in August 2011
40th employee, now over 160
Monday, April 8, 13
4. About Tumblr
blogging platform / social network
100,000,000 blogs!
unique signals:
asynchronous following graph
reblogs, likes, replies
Monday, April 8, 13
5. About You
Country Month Value
USA March 10000
USA April 12000
USA May 14000 Country March Apr May
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
Monday, April 8, 13
6. About You
Country Month Value
USA March 10000
USA April 12000
USA May 14000 Country March Apr May
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
Pivot Table!
Monday, April 8, 13
7. About You
Country Month Value
USA March 1000
USA April 12000
Country March Apr May
USA May 14000
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
Monday, April 8, 13
8. About You
Country Month Value
USA March 1000
USA April 12000
Country March Apr May
USA May 14000
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
pivoted <- cast(melted, country~month)
melted <- melt.data.frame(pivoted, id.vars='country')
Monday, April 8, 13
9. About You
Country Month Value
USA March 1000
USA April 12000
USA May 14000 Country March Apr May
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
Monday, April 8, 13
10. About You
Country Month Value
USA March 1000
USA April 12000
USA May 14000 Country March Apr May
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000
Who Cares?
Monday, April 8, 13
14. What tools we use
What we do with those tools
Monday, April 8, 13
15. Plumbing
John D. Cook "The plumber programmer"
November 2011 http://bit.ly/XfcXrt
Monday, April 8, 13
16. Pipes
1. Record events / actions
2. Store / archive everything
3. Extract information
a. Reports / BI
b. Back to Tumblr application
Monday, April 8, 13
18. Scribe
Web Servers Scribe Servers
Continuously Daily
HDFS
Writing Cron
Monday, April 8, 13
19. Step 2: Store in Hadoop
One huge computer:
300TB hard drive
7.8TB of RAM
85 x 2 = 170 hex-core processors
Monday, April 8, 13
20. Step 2: Store in Hadoop
One huge computer:
300TB hard drive
7.8TB of RAM
85 x 2 = 170 hex-core processors
One huge PITA:
awful docs (search-hadoop.com helps)
java everywhere
fragmented community
Monday, April 8, 13
21. Hadoop
hive
pig
map/reduce
Monday, April 8, 13
22. Hive
"Basically SQL" 10 most liked posts
Compiles to Java map/reduce
SELECT
About 100 hive tables root_post_id,
count(*) AS likes
FROM posts
WHERE
Each "table" is really a directory action='like'
of flat files ORDER BY likes DESC
LIMIIT 10;
Monday, April 8, 13
23. Hive Partitions
File location in HDFS Hive partition value
/posts/2013/03/26/*.lzo dt='2013-03-26'
/posts/2013/03/27/*.lzo dt='2013-03-26'
/posts/2013/03/28/*.lzo dt='2013-03-26'
SELECT action, COUNT(*) AS views
SELECT action, COUNT(*) AS views
FROM pageviews
FROM pageviews
WHERE ts > 1330927200
WHERE dt = "2012-03-05"
AND ts < 1331013600
GROUP BY action
GROUP BY action
204 mappers 22,895 mappers
Monday, April 8, 13
24. Extending Hive: Streaming
•Add all .py files you’ll need to the query
•Sends each record to python script via stdin
•Can be used as a subquery in a “normal” hive query
#!/usr/bin/python
add file helpers.py;
## helpers.py
FROM
import sys, re
users
gmail = re.compile(r'.+@gmail.com')
SELECT
for row in sys.stdin:
TRANSFORM(id, email)
id, email = row.split('t')
USING 'helpers.py'
if gmail.match(email):
AS (id_with_gmail)
print id
Monday, April 8, 13
25. Pig
posts = LOAD 'posts.tsv' AS (
root_post_id:int,
action:chararray
);
"Basically SQL" if you had to likes = FILTER posts BY action=='like';
explain it piece by piece. grouped = GROUP likes BY root_post_id;
counted = FOREACH grouped GENERATE
"DataBag" == "DataFrame" group AS root_post_id,
COUNT(likes.root_post_id) AS likes;
sorted = ORDER counted BY likes DESC;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10.csv';
Monday, April 8, 13
26. Extending Pig: Python UDFs
Extract word prefixes for type-
ahead tag search
def prefixes(input, max_len=3):
nchar = min(len(input), max_len) + 1
return [input[:i] for i in range(1,nchar)]
>>> prefixes('museum')
['m', 'mu', 'mus', 'muse', 'museu', 'museum']
Monday, April 8, 13
27. Extending Pig: Python UDFs
Extract word prefixes for type-
ahead tag search
@outputSchema("t:(prefix:chararray)")
def prefixes(input, max_len=3):
nchar = min(len(input), max_len) + 1
return [input[:i] for i in range(1,nchar)]
>>> prefixes('museum')
['m', 'mu', 'mus', 'muse', 'museu', 'museum']
Monday, April 8, 13
28. Extending Pig: Java UDFs
package com.tumblr.swine;
import java.util.ArrayList;
import java.util.List;
public class Prefixes {
private int maxTermLen;
public Prefixes() {
this.maxTermLen = Integer.MAX_VALUE;
}
public Prefixes(int maxTermLen) {
this.maxTermLen = maxTermLen;
}
public List<String> get(String s) {
int size = s.length() < maxTermLen ? s.length() : maxTermLen;
ArrayList<String> results = new ArrayList<String>();
for (int i=1; i < size + 1; i++) {
results.add(s.substring(0,i));
}
return results;
}
}
Monday, April 8, 13
29. package com.tumblr.swine.pig;
Extending Pig: Java UDFs
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.EvalFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.DefaultBagFactory;
import org.apache.pig.data.Tuple;
package com.tumblr.swine; import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import java.util.ArrayList; public class Prefixes extends EvalFunc<DataBag> {
import java.util.List; public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
public class Prefixes { DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
String word = (String)input.get(0);
int max = Integer.MAX_VALUE;
if (input.size() == 2) {
private int maxTermLen; }
max = (Integer)input.get(1);
com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max);
for (String prefix : prefixes.get(word)) {
Tuple t = TupleFactory.getInstance().newTuple(1);
public Prefixes() { t.set(0, prefix);
output.add(t);
this.maxTermLen = Integer.MAX_VALUE; }
return output;
} }catch(Exception e){
System.err.println("Prefixes: failed to process input; error - " + e.getMessage());
return null;
}
public Prefixes(int maxTermLen) { }
this.maxTermLen = maxTermLen; @Override
public Schema outputSchema(Schema input) {
} Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY));
try{
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
public List<String> get(String s) { bagSchema, DataType.BAG));
}catch (FrontendException e){
int size = s.length() < maxTermLen ? s.length() : maxTermLen; }
return null;
ArrayList<String> results = new ArrayList<String>(); }
for (int i=1; i < size + 1; i++) { @Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
results.add(s.substring(0,i)); List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2);
Schema s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
} funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
// Allow specifying optional max length of prefix
return results; s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
} s.add(new Schema.FieldSchema(null, DataType.INTEGER));
funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
} return funcSpecs;
}
}
Monday, April 8, 13
30. HUE
Keeps query history
Preview tables / results
Save queries & templates
Monday, April 8, 13
31. What tools we use
What we do with those tools
Monday, April 8, 13
32. Spam
Classic example of supervised learning
Don't get too clever
Build good tooling!
Monday, April 8, 13
33. Spam: Vowpal Wabbit
Online (continuously learning) system
Updates parameters with every new piece of information
Parallelizable, can run as service, very fast.
Loss functions:
•squared
•logistic
•hinge
•quantile
Monday, April 8, 13
34. Spam: Vowpal Wabbit
blog: 'adamlaiacano',
Post: tags: ['free ipad', 'warez'],
location: 'US~NY-New York',
is_suspended: 0 or 1
Model: is_suspended ~ free_ipad + warez + US~NY-New_York + .....
Square loss function
Very high dimension: L1 regularization to avoid overfitting
Great precision, decent recall
Monday, April 8, 13
35. Type - Ahead search
Most popular tags for any letter combination
Store daily results in distributed Redis cluster
m: [me, model, mine]
mu: [muscle, muscles, music video]
mus: [muscle, muscles, music video]
muse: [muse, museum, nine muses]
museu: [museum, metropolitan museum of art,
natural history museum]
Monday, April 8, 13
36. Type - Ahead search
Only keep popular prefixes: tag must occur 10 times
Only update keys that have changed.
- muse: [muse, museum, nine muses]
+ muse: [muse, museum, arizona muse]
Monday, April 8, 13
37. Questions?
@adamlaiacano
http://adamlaiacano.tumblr.com
Monday, April 8, 13