SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Data at Tumblr

                            Adam Laiacano
                        NYC Data Science Meetup

                           @adamlaiacano
                        adamlaiacano.tumblr.com

Monday, April 8, 13
What I Needed to Learn
                      When I Started My Job




Monday, April 8, 13
About Me


                           Electrical Engineering background
                      Worked at CBS to learn more about stats / data

                              Joined Tumblr in August 2011
                              40th employee, now over 160




Monday, April 8, 13
About Tumblr
                      blogging platform / social network
                             100,000,000 blogs!

                              unique signals:
                       asynchronous following graph
                           reblogs, likes, replies




Monday, April 8, 13
About You
                Country   Month   Value
                  USA     March   10000
                  USA     April   12000
                  USA      May    14000         Country   March   Apr     May
                Canada    March    7000          USA      10000   12000   14000
                Canada    April    6500         Canada    7000    6500    5000
                Canada     May     5000         France    1200    1400    2000
                 France   March    1200
                 France   April    1400
                 France    May     2000




Monday, April 8, 13
About You
                Country   Month   Value
                  USA     March   10000
                  USA     April   12000
                  USA      May    14000         Country   March   Apr     May
                Canada    March    7000          USA      10000   12000   14000
                Canada    April    6500         Canada    7000    6500    5000
                Canada     May     5000         France    1200    1400    2000
                 France   March    1200
                 France   April    1400
                 France    May     2000




                                      Pivot Table!
Monday, April 8, 13
About You
                  Country   Month   Value
                    USA     March    1000
                    USA     April   12000
                                                  Country   March   Apr     May
                    USA      May    14000
                  Canada    March    7000          USA      10000   12000   14000
                  Canada    April    6500         Canada    7000    6500    5000
                  Canada     May     5000         France    1200    1400    2000
                   France   March    1200
                   France   April    1400
                   France    May     2000




Monday, April 8, 13
About You
                  Country   Month   Value
                    USA     March    1000
                    USA     April   12000
                                                  Country   March   Apr     May
                    USA      May    14000
                  Canada    March    7000          USA      10000   12000   14000
                  Canada    April    6500         Canada    7000    6500    5000
                  Canada     May     5000         France    1200    1400    2000
                   France   March    1200
                   France   April    1400
                   France    May     2000

         pivoted <- cast(melted, country~month)
         melted <- melt.data.frame(pivoted, id.vars='country')


Monday, April 8, 13
About You
                Country   Month   Value
                  USA     March    1000
                  USA     April   12000
                  USA      May    14000         Country   March   Apr     May
                Canada    March    7000          USA      10000   12000   14000
                Canada    April    6500         Canada    7000    6500    5000
                Canada     May     5000         France    1200    1400    2000
                 France   March    1200
                 France   April    1400
                 France    May     2000




Monday, April 8, 13
About You
                Country   Month   Value
                  USA     March    1000
                  USA     April   12000
                  USA      May    14000         Country   March   Apr     May
                Canada    March    7000          USA      10000   12000   14000
                Canada    April    6500         Canada    7000    6500    5000
                Canada     May     5000         France    1200    1400    2000
                 France   March    1200
                 France   April    1400
                 France    May     2000




                                    Who Cares?
Monday, April 8, 13
One more question:




Monday, April 8, 13
Monday, April 8, 13
Hadoop




Monday, April 8, 13
What tools we use

                      What we do with those tools




Monday, April 8, 13
Plumbing




                             John D. Cook "The plumber programmer"
                             November 2011 http://bit.ly/XfcXrt

Monday, April 8, 13
Pipes

                      1. Record events / actions
                      2. Store / archive everything
                      3. Extract information
                        a. Reports / BI
                        b. Back to Tumblr application




Monday, April 8, 13
Step 1: Log Events
                      GiantOctopus: in-house event logging system.

                 Built-in Variables
                 •timestamp                 GiantOctopus::log(
                                                ‘posts’,
                 •referring page                array(‘send_to_fb’=>1,
                 •user identifier                     )
                                                       ‘send_to_twitter’=>0

                 •action identifier          );

                 •location (city)
                 •language setting

Monday, April 8, 13
Scribe
                      Web Servers       Scribe Servers




                              Continuously               Daily
                                                                 HDFS
                                Writing                  Cron

Monday, April 8, 13
Step 2: Store in Hadoop
                              One huge computer:
                               300TB hard drive
                                 7.8TB of RAM
                       85 x 2 = 170 hex-core processors




Monday, April 8, 13
Step 2: Store in Hadoop
                              One huge computer:
                               300TB hard drive
                                 7.8TB of RAM
                       85 x 2 = 170 hex-core processors


                               One huge PITA:
                         awful docs (search-hadoop.com helps)
                               java everywhere
                           fragmented community


Monday, April 8, 13
Hadoop

                         hive

                         pig

                      map/reduce




Monday, April 8, 13
Hive

           "Basically SQL"                      10 most liked posts

           Compiles to Java map/reduce
                                                 SELECT
           About 100 hive tables                     root_post_id,
                                                     count(*) AS likes
                                                 FROM posts
                                                 WHERE
           Each "table" is really a directory        action='like'
           of flat files                           ORDER BY likes DESC
                                                 LIMIIT 10;




Monday, April 8, 13
Hive Partitions
                        File location in HDFS         Hive partition value
                      /posts/2013/03/26/*.lzo          dt='2013-03-26'
                      /posts/2013/03/27/*.lzo          dt='2013-03-26'
                      /posts/2013/03/28/*.lzo          dt='2013-03-26'




                                                SELECT action, COUNT(*) AS views
        SELECT action, COUNT(*) AS views
                                                    FROM pageviews
            FROM pageviews
                                                    WHERE ts > 1330927200
            WHERE dt = "2012-03-05"
                                                        AND ts < 1331013600
            GROUP BY action
                                                    GROUP BY action

                          204 mappers                  22,895 mappers

Monday, April 8, 13
Extending Hive: Streaming
                      •Add all .py files you’ll need to the query
                      •Sends each record to python script via stdin
                      •Can be used as a subquery in a “normal” hive query

                                          #!/usr/bin/python
             add file helpers.py;
                                          ## helpers.py
             FROM
                                          import sys, re
                 users
                                          gmail = re.compile(r'.+@gmail.com')
             SELECT
                                          for row in sys.stdin:
               TRANSFORM(id, email)
                                              id, email = row.split('t')
               USING 'helpers.py'
                                              if gmail.match(email):
               AS (id_with_gmail)
                                                  print id



Monday, April 8, 13
Pig
                                            posts = LOAD 'posts.tsv' AS (
                                                root_post_id:int,
                                                action:chararray
                                            );
            "Basically SQL" if you had to   likes = FILTER posts BY action=='like';
            explain it piece by piece.      grouped = GROUP likes BY root_post_id;

                                            counted = FOREACH grouped GENERATE
            "DataBag" == "DataFrame"            group AS root_post_id,
                                                COUNT(likes.root_post_id) AS likes;

                                            sorted = ORDER counted BY likes DESC;

                                            top10 = LIMIT sorted 10;

                                            STORE top10 INTO 'top10.csv';




Monday, April 8, 13
Extending Pig: Python UDFs
                                Extract word prefixes for type-
                                      ahead tag search

                                def prefixes(input, max_len=3):
                                    nchar = min(len(input), max_len) + 1
                                    return [input[:i] for i in range(1,nchar)]


                                 >>> prefixes('museum')
                                 ['m', 'mu', 'mus', 'muse', 'museu', 'museum']




Monday, April 8, 13
Extending Pig: Python UDFs
                                Extract word prefixes for type-
                                      ahead tag search

                                @outputSchema("t:(prefix:chararray)")
                                def prefixes(input, max_len=3):
                                    nchar = min(len(input), max_len) + 1
                                    return [input[:i] for i in range(1,nchar)]


                                 >>> prefixes('museum')
                                 ['m', 'mu', 'mus', 'muse', 'museu', 'museum']




Monday, April 8, 13
Extending Pig: Java UDFs
                package com.tumblr.swine;

                import java.util.ArrayList;
                import java.util.List;

                public class Prefixes {

                      private int maxTermLen;

                      public Prefixes() {
                          this.maxTermLen = Integer.MAX_VALUE;
                      }

                      public Prefixes(int maxTermLen) {
                          this.maxTermLen = maxTermLen;
                      }

                      public List<String> get(String s) {
                          int size = s.length() < maxTermLen ? s.length() : maxTermLen;
                          ArrayList<String> results = new ArrayList<String>();
                          for (int i=1; i < size + 1; i++) {
                              results.add(s.substring(0,i));
                          }
                          return results;
                      }
                }




Monday, April 8, 13
package com.tumblr.swine.pig;




                                    Extending Pig: Java UDFs
                                                                                      import java.io.IOException;
                                                                                      import java.util.ArrayList;

                                                                                      import java.util.List;

                                                                                      import   org.apache.pig.EvalFunc;
                                                                                      import   org.apache.pig.FuncSpec;
                                                                                      import   org.apache.pig.data.DataBag;
                                                                                      import   org.apache.pig.data.DataType;
                                                                                      import   org.apache.pig.data.DefaultBagFactory;
                                                                                      import   org.apache.pig.data.Tuple;
                package com.tumblr.swine;                                             import   org.apache.pig.data.TupleFactory;
                                                                                      import   org.apache.pig.impl.logicalLayer.FrontendException;
                                                                                      import   org.apache.pig.impl.logicalLayer.schema.Schema;

                import java.util.ArrayList;                                           public class Prefixes extends EvalFunc<DataBag> {

                import java.util.List;                                                    public DataBag exec(Tuple input) throws IOException {
                                                                                              if (input == null || input.size() == 0)
                                                                                                   return null;
                                                                                              try{
                public class Prefixes {                                                            DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
                                                                                                   String word = (String)input.get(0);
                                                                                                   int max = Integer.MAX_VALUE;
                                                                                                   if (input.size() == 2) {
                      private int maxTermLen;                                                      }
                                                                                                       max = (Integer)input.get(1);

                                                                                                   com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max);
                                                                                                   for (String prefix : prefixes.get(word)) {
                                                                                                       Tuple t = TupleFactory.getInstance().newTuple(1);
                      public Prefixes() {                                                              t.set(0, prefix);
                                                                                                       output.add(t);
                          this.maxTermLen = Integer.MAX_VALUE;                                     }
                                                                                                   return output;
                      }                                                                       }catch(Exception e){
                                                                                                   System.err.println("Prefixes: failed to process input; error - " + e.getMessage());
                                                                                                   return null;
                                                                                              }
                      public Prefixes(int maxTermLen) {                                   }

                          this.maxTermLen = maxTermLen;                                   @Override
                                                                                          public Schema outputSchema(Schema input) {
                      }                                                                       Schema bagSchema = new Schema();
                                                                                              bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY));
                                                                                              try{
                                                                                                   return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                      public List<String> get(String s) {                                                  bagSchema, DataType.BAG));
                                                                                              }catch (FrontendException e){
                          int size = s.length() < maxTermLen ? s.length() : maxTermLen;       }
                                                                                                   return null;

                          ArrayList<String> results = new ArrayList<String>();            }

                          for (int i=1; i < size + 1; i++) {         @Override
                                                                     public List<FuncSpec> getArgToFuncMapping() throws FrontendException                      {
                              results.add(s.substring(0,i));             List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2);
                                                                         Schema s = new Schema();
                                                                         s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
                          }                                              funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
                                                                         // Allow specifying optional max length of prefix
                          return results;                                s = new Schema();
                                                                         s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
                      }                                                  s.add(new Schema.FieldSchema(null, DataType.INTEGER));
                                                                                                funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
                }                                                                               return funcSpecs;
                                                                                          }

                                                                                      }




Monday, April 8, 13
HUE


       Keeps query history

       Preview tables / results

       Save queries & templates




Monday, April 8, 13
What tools we use

                      What we do with those tools




Monday, April 8, 13
Spam


                      Classic example of supervised learning

                      Don't get too clever

                      Build good tooling!




Monday, April 8, 13
Spam: Vowpal Wabbit
                 Online (continuously learning) system

                 Updates parameters with every new piece of information

                 Parallelizable, can run as service, very fast.

                 Loss functions:
                 •squared
                 •logistic
                 •hinge
                 •quantile
Monday, April 8, 13
Spam: Vowpal Wabbit
                                blog:           'adamlaiacano',
                      Post:     tags:           ['free ipad', 'warez'],
                                location:       'US~NY-New York',
                                is_suspended:   0 or 1



                      Model:   is_suspended ~ free_ipad + warez + US~NY-New_York + .....




                      Square loss function
                      Very high dimension: L1 regularization to avoid overfitting
                      Great precision, decent recall


Monday, April 8, 13
Type - Ahead search

                      Most popular tags for any letter combination

                      Store daily results in distributed Redis cluster

                 m:            [me, model, mine]
                 mu:           [muscle, muscles, music video]
                 mus:          [muscle, muscles, music video]
                 muse:         [muse, museum, nine muses]
                 museu:        [museum, metropolitan museum of art,
                                natural history museum]

Monday, April 8, 13
Type - Ahead search

                      Only keep popular prefixes: tag must occur 10 times

                      Only update keys that have changed.


                 - muse:        [muse, museum, nine muses]
                 + muse:        [muse, museum, arizona muse]



Monday, April 8, 13
Questions?



                             @adamlaiacano

                      http://adamlaiacano.tumblr.com




Monday, April 8, 13

Weitere ähnliche Inhalte

Kürzlich hochgeladen

A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityEric T. Tung
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfAmzadHosen3
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...amitlee9823
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...lizamodels9
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756dollysharma2066
 

Kürzlich hochgeladen (20)

A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdf
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Data Science at Tumblr

  • 1. Data at Tumblr Adam Laiacano NYC Data Science Meetup @adamlaiacano adamlaiacano.tumblr.com Monday, April 8, 13
  • 2. What I Needed to Learn When I Started My Job Monday, April 8, 13
  • 3. About Me Electrical Engineering background Worked at CBS to learn more about stats / data Joined Tumblr in August 2011 40th employee, now over 160 Monday, April 8, 13
  • 4. About Tumblr blogging platform / social network 100,000,000 blogs! unique signals: asynchronous following graph reblogs, likes, replies Monday, April 8, 13
  • 5. About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Monday, April 8, 13
  • 6. About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Pivot Table! Monday, April 8, 13
  • 7. About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Monday, April 8, 13
  • 8. About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 pivoted <- cast(melted, country~month) melted <- melt.data.frame(pivoted, id.vars='country') Monday, April 8, 13
  • 9. About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Monday, April 8, 13
  • 10. About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Who Cares? Monday, April 8, 13
  • 14. What tools we use What we do with those tools Monday, April 8, 13
  • 15. Plumbing John D. Cook "The plumber programmer" November 2011 http://bit.ly/XfcXrt Monday, April 8, 13
  • 16. Pipes 1. Record events / actions 2. Store / archive everything 3. Extract information a. Reports / BI b. Back to Tumblr application Monday, April 8, 13
  • 17. Step 1: Log Events GiantOctopus: in-house event logging system. Built-in Variables •timestamp GiantOctopus::log( ‘posts’, •referring page array(‘send_to_fb’=>1, •user identifier ) ‘send_to_twitter’=>0 •action identifier ); •location (city) •language setting Monday, April 8, 13
  • 18. Scribe Web Servers Scribe Servers Continuously Daily HDFS Writing Cron Monday, April 8, 13
  • 19. Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processors Monday, April 8, 13
  • 20. Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processors One huge PITA: awful docs (search-hadoop.com helps) java everywhere fragmented community Monday, April 8, 13
  • 21. Hadoop hive pig map/reduce Monday, April 8, 13
  • 22. Hive "Basically SQL" 10 most liked posts Compiles to Java map/reduce SELECT About 100 hive tables root_post_id, count(*) AS likes FROM posts WHERE Each "table" is really a directory action='like' of flat files ORDER BY likes DESC LIMIIT 10; Monday, April 8, 13
  • 23. Hive Partitions File location in HDFS Hive partition value /posts/2013/03/26/*.lzo dt='2013-03-26' /posts/2013/03/27/*.lzo dt='2013-03-26' /posts/2013/03/28/*.lzo dt='2013-03-26' SELECT action, COUNT(*) AS views SELECT action, COUNT(*) AS views FROM pageviews FROM pageviews WHERE ts > 1330927200 WHERE dt = "2012-03-05" AND ts < 1331013600 GROUP BY action GROUP BY action 204 mappers 22,895 mappers Monday, April 8, 13
  • 24. Extending Hive: Streaming •Add all .py files you’ll need to the query •Sends each record to python script via stdin •Can be used as a subquery in a “normal” hive query #!/usr/bin/python add file helpers.py; ## helpers.py FROM import sys, re users gmail = re.compile(r'.+@gmail.com') SELECT for row in sys.stdin: TRANSFORM(id, email) id, email = row.split('t') USING 'helpers.py' if gmail.match(email): AS (id_with_gmail) print id Monday, April 8, 13
  • 25. Pig posts = LOAD 'posts.tsv' AS ( root_post_id:int, action:chararray ); "Basically SQL" if you had to likes = FILTER posts BY action=='like'; explain it piece by piece. grouped = GROUP likes BY root_post_id; counted = FOREACH grouped GENERATE "DataBag" == "DataFrame" group AS root_post_id, COUNT(likes.root_post_id) AS likes; sorted = ORDER counted BY likes DESC; top10 = LIMIT sorted 10; STORE top10 INTO 'top10.csv'; Monday, April 8, 13
  • 26. Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes('museum') ['m', 'mu', 'mus', 'muse', 'museu', 'museum'] Monday, April 8, 13
  • 27. Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search @outputSchema("t:(prefix:chararray)") def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes('museum') ['m', 'mu', 'mus', 'muse', 'museu', 'museum'] Monday, April 8, 13
  • 28. Extending Pig: Java UDFs package com.tumblr.swine; import java.util.ArrayList; import java.util.List; public class Prefixes { private int maxTermLen; public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; } public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; } public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; } } Monday, April 8, 13
  • 29. package com.tumblr.swine.pig; Extending Pig: Java UDFs import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.FuncSpec; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.DefaultBagFactory; import org.apache.pig.data.Tuple; package com.tumblr.swine; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; import java.util.ArrayList; public class Prefixes extends EvalFunc<DataBag> { import java.util.List; public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ public class Prefixes { DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); String word = (String)input.get(0); int max = Integer.MAX_VALUE; if (input.size() == 2) { private int maxTermLen; } max = (Integer)input.get(1); com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max); for (String prefix : prefixes.get(word)) { Tuple t = TupleFactory.getInstance().newTuple(1); public Prefixes() { t.set(0, prefix); output.add(t); this.maxTermLen = Integer.MAX_VALUE; } return output; } }catch(Exception e){ System.err.println("Prefixes: failed to process input; error - " + e.getMessage()); return null; } public Prefixes(int maxTermLen) { } this.maxTermLen = maxTermLen; @Override public Schema outputSchema(Schema input) { } Schema bagSchema = new Schema(); bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY)); try{ return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), public List<String> get(String s) { bagSchema, DataType.BAG)); }catch (FrontendException e){ int size = s.length() < maxTermLen ? s.length() : maxTermLen; } return null; ArrayList<String> results = new ArrayList<String>(); } for (int i=1; i < size + 1; i++) { @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { results.add(s.substring(0,i)); List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); // Allow specifying optional max length of prefix return results; s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); } return funcSpecs; } } Monday, April 8, 13
  • 30. HUE Keeps query history Preview tables / results Save queries & templates Monday, April 8, 13
  • 31. What tools we use What we do with those tools Monday, April 8, 13
  • 32. Spam Classic example of supervised learning Don't get too clever Build good tooling! Monday, April 8, 13
  • 33. Spam: Vowpal Wabbit Online (continuously learning) system Updates parameters with every new piece of information Parallelizable, can run as service, very fast. Loss functions: •squared •logistic •hinge •quantile Monday, April 8, 13
  • 34. Spam: Vowpal Wabbit blog: 'adamlaiacano', Post: tags: ['free ipad', 'warez'], location: 'US~NY-New York', is_suspended: 0 or 1 Model: is_suspended ~ free_ipad + warez + US~NY-New_York + ..... Square loss function Very high dimension: L1 regularization to avoid overfitting Great precision, decent recall Monday, April 8, 13
  • 35. Type - Ahead search Most popular tags for any letter combination Store daily results in distributed Redis cluster m: [me, model, mine] mu: [muscle, muscles, music video] mus: [muscle, muscles, music video] muse: [muse, museum, nine muses] museu: [museum, metropolitan museum of art, natural history museum] Monday, April 8, 13
  • 36. Type - Ahead search Only keep popular prefixes: tag must occur 10 times Only update keys that have changed. - muse: [muse, museum, nine muses] + muse: [muse, museum, arizona muse] Monday, April 8, 13
  • 37. Questions? @adamlaiacano http://adamlaiacano.tumblr.com Monday, April 8, 13