SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Crab
                A Python Framework for Building
                    Recommendation Engines
                  PythonBrasil 2011, São Paulo, SP


Marcel Caraciolo Ricardo Caspirro                Bruno Melo
   @marcelcaraciolo        @ricardocaspirro          @brunomelo
What is Crab ?

 A python framework for building recommendation engines
A Scikit module for collaborative, content and hybrid filtering
       Mahout Alternative for Python Developers :D
             Open-Source under the BSD license


             https://github.com/muricoca/crab
When started ?

It began one year ago
Community-driven, 4 members
Since April,2011 the open-source labs Muriçoca incorporated it
Since April,2011 rewritting it as Scikit




                https://github.com/muricoca/
Knowing Scikits
Scikits are Scipy Toolkits - independent and projects hosted
                under a common namespace.


                       Scikits Image
                     Scikits MlabWrap
                     Scikits AudioLab
                      Scikit Learn
                             ....

           http://scikits.appspot.com/scikits
Knowing Scikits

                        Scikit-Learn

    Machine Learning Algorithms + scientific Python packages
                (Numpy, Scipy and Matplotlib)

           http://scikit-learn.sourceforge.net/


Our goal: Incorporate the Crab as Scikit and incorporate
           some parts of them at Scikit-learn
Why Recommendations ?
The world is an over-crowded place
 !"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#
Why Recommendations
     * +,&-.$/).#&0#/"1.#$%234(".#                   ?
       $/)#5(&6 7&.2.#"$4,#)$8
                   We are overloaded
     * 93((3&/.#&0#:&'3".;#5&&<.#
         $/)#:-.34#2%$4<.#&/(3/"
Thousands of news articles and blog posts each day
       * =/#>$/&3;#?#@A#+B#4,$//"(.;#
          2,&-.$/).#&0#7%&6%$:.#
 Millions of movies, books and music tracks online
          "$4,#)$8
          Several Places, Offers and Events

     * =/#C"1#D&%<;#."'"%$(#
  Even Friends sometimes we are overloaded !

         2,&-.$/).#&0#$)#:"..$6".#
         ."/2#2&#-.#7"%#)$8
Why Recommendations ?
We really need and consume only a few of them!

   “A lot of times, people don’t know what
   they want until you show it to them.”
                                         Steve Jobs

  “We are leaving the Information age, and
  entering into the Recommendation age.”
                      Chris Anderson, from book Long Tail
Why Recommendations ?
Can Google help ?
  Yes, but only when we really know what we are looking for
           But, what’s does it mean by “interesting” ?
Can Facebook help ?
  Yes, I tend to find my friends’ stuffs interesting
   What if i had only few friends and what they like do not always
                             attract me ?
Can experts help ?
  Yes, but it won’t scale well.
    But it is what they like, not me! Exactly same advice!
Why Recommendations ?
         Recommendation Systems
Systems designed to recommend to me something I may like
Why Recommendations ?
     !"#$%&"'$"'(')*#*+,)
     Recommendation Systems

      -+*#)+.               -#/')             0#)1#




                                    !
2'              23&4"+')1               5,6           7),*%'"&863


                      Graph Representation
The current Crab

Collaborative Filtering algorithms
 User-Based, Item-Based and Factorization Matrix (SVD)

Evaluation of the Recommender Algorithms
 Precision, Recall, F1-Score, RMSE




                           Precision-Recall Charts
The current Crab




   Precision-Recall Charts
Collaborative Filtering




                O Vento                         Toy
Thor                            Armagedon              Items
                 Levou                         Store

like
                                recommends


       Marcel        Rafael           Amanda           Users




                      Similar
The current Crab
The current Crab
>>>#load the dataset
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
{'DESCR': 'sample_movies data set was collected by the book called
          nProgramming the Collective Intelligence by Toby Segaran nnNotesn-----
          nThis data set consists ofnt* n ratings with (1-5) from n users to n movies.',
 'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},
  2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},
  3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},
  4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},
  5: {2: 4.5, 3: 1.0, 4: 4.0},
  6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},
  7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}},
 'item_ids': {1: 'Lady in the Water',
  2: 'Snakes on a Planet',
  3: 'You, Me and Dupree',
  4: 'Superman Returns',
  5: 'The Night Listener',
  6: 'Just My Luck'},
 'user_ids': {1: 'Jack Matthews',
  2: 'Mick LaSalle',
  3: 'Claudia Puig',
  4: 'Lisa Rose',
  5: 'Toby',
  6: 'Gene Seymour',
  7: 'Michael Phillips'}}
The current Crab
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)

>>> print m
MatrixPreferenceDataModel (7 by 6)
         1          2          3          4            5        ...
1        3.000000   4.000000   3.500000   5.000000   3.000000
2        3.000000   4.000000   2.000000   3.000000   3.000000
3           ---     3.500000   2.500000   4.000000   4.500000
4        2.500000   3.500000   2.500000   3.500000   3.000000
5           ---     4.500000   1.000000   4.000000       ---
6        3.000000   3.500000   3.500000   5.000000   3.000000
7        2.500000   3.000000       ---    3.500000   4.000000
The current Crab
The current Crab
>>> #import pairwise distance
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m,
       euclidean_distances)
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m,
       euclidean_distances)
>>> similarity[1]
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
 >>> #import similarity
 >>> from crab.similarities import UserSimilarity
 >>> similarity = UserSimilarity(m,
        euclidean_distances)
 >>> similarity[1]
       [(1, 1.0),
(6, 0.66666666666666663),
(4, 0.34054242658316669),
(3, 0.32037724101704074),
(7, 0.32037724101704074),
(2, 0.2857142857142857),
(5, 0.2674788903885893)]
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
 >>> #import similarity
 >>> from crab.similarities import UserSimilarity
 >>> similarity = UserSimilarity(m,
        euclidean_distances)
 >>> similarity[1]
       [(1, 1.0),
(6, 0.66666666666666663),   MatrixPreferenceDataModel (7 by 6)
                                     1          2          3          4            5
(4, 0.34054242658316669),   1        3.000000   4.000000   3.500000   5.000000   3.000000
(3, 0.32037724101704074),   2        3.000000   4.000000   2.000000   3.000000   3.000000
                            3           ---     3.500000   2.500000   4.000000   4.500000
(7, 0.32037724101704074),   4        2.500000   3.500000   2.500000   3.500000   3.000000
                            5           ---     4.500000   1.000000   4.000000       ---
(2, 0.2857142857142857),    6        3.000000   3.500000   3.500000   5.000000   3.000000
(5, 0.2674788903885893)]    7        2.500000   3.000000       ---    3.500000   4.000000
The current Crab
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ],
       [ 6. , 3. ],
       [ 7. , 2.5],
       [ 4. , 2.5]])
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ],       MatrixPreferenceDataModel (7 by 6)
                                   1          2          3        4                     5        ...
       [ 6. , 3. ],       1        3.000000   4.000000   3.500000 5.000000            3.000000
                          2        3.000000   4.000000   2.000000 3.000000            3.000000
       [ 7. , 2.5],       3           ---     3.500000   2.500000 4.000000            4.500000
       [ 4. , 2.5]])      4        2.500000   3.500000   2.500000 3.500000            3.000000
                                   5         ---     4.500000   1.000000   4.000000       ---
                                   6      3.000000   3.500000   3.500000   5.000000   3.000000
                                   7      2.500000   3.000000      ---     3.500000   4.000000
The current Crab




Using REST APIs to deploy the recommender
          django-piston, django-rest, django-tastypie
Crab is already in production

   News from Abril Publisher recommendations!
                    Collecting over 10 magazines, 20 books and 100+ articles




  Running on Python
      + Scipy +
       Django

Content-Based-Filtering


Easy-to-use interface

  Still in development
Content Based Filtering

                   Similar




Duro de            O Vento                         Toy
                                Armagedon                  Items
 Matar              Levou                         Store


                                      recommend
          likes

                             Marcel                       Users
Crab is already in production

        PythonBrasil keynotes Recommender
               Recommending keynotes based on a hybrid approach




  Running on Python
      + Scipy +
       Django
Content-Based-Filtering
          +
Collaborative Filtering

   Schedule your
     keynotes

   Still in development
source, the recommendation architecture that we propose will                    would rely more on collaborative-filtering techniques, that is,
aggregate the results of such filtering techniques.                                   Bezerra and Carvalho proposed approaches where the results
                                                                                the reviews from similar users.
   We aim at integrating the previously mentioned hybrid prod-                     Figure 1 shows a overview of our meta recommender
                                                                                     achieved showed to be very promising [19].
                                                                                approach. By combining the content-based filtering and the
uct recommendation approach in a mobile application so the
                                                                                                                                                                                               A.

                   Crab is already in production
users could benefit from useful and logical recommendations.                     collaborative-based one into a hybrid recommender system, it
Moreover, we aim at providing a suited explanation for each                     would use the services/products III. S YSTEM catalogues
                                                                                                                repositories which D ESIGN
recommendation to the user, since the current approaches just                   the services to be recommended, and the review repository
                                                                                        Application data information our mobile recommender sys-
                                                                                that contains the user opinions about those services. All this                                                 for
only deliver product recommendations with a overall score
without pointing out the appropriateness of such recommen-                      datatembecan be from data source containers in the web product description
                                                                                      can    extracted divided into two parts: the                                                             rec
dation [13]. Besides the basic information provided by the                      such(such location-based social network Foursquare its attributes) and the user
                                                                                      as the as location, description and [17] as

                                         Hybrid Meta Approach gives the system’s architecture and
suppliers, the system will deliver the explanation, providing
relevant reviews of similar users, we believe that it will
                                                  tags, etc.). The Figure 3
increase the confidence in the buying decision process and the
                                                                                displayed at the Figure 2 and the location recommendation
                                                                                engine from Google: Google HotPot [18]. by user (such as rating, comments,
                                                                                     reviews or ratings provided
                                                                                                                                                                                               mo
                                                                                                                                                                                               wh
product accepptance rate. In the mobile context this approach
                                                                                                                                                                                               po
could help the users in this process and showing the user
                                                                                   relative components.                                                                                        thi
opinions could contribute to achieve this task.                                                                                                                                                rec
                                                                                                                                                                                               spe
                                                                                     !"#$"%&'$                                                         5&-$
        !"#$%&'%($)                               !".,"/#)                                                                                                                                     acc
        !"*+#,$+'-)                              !"*+#,$+'-)                                                                +,-*.&$
                                                                                                           !(#$()&'*&%$
                                                                                                                           /01&'234&$          !6#$6,00&41&7$
                                                                                                                                                                                               wh
                                                                                                                                                                                               res
                                                                                                                                   !<#$<'&2&'&04&%A$B,431*,0A$&14C$
                                                                                                                                                                                               ves
                                              0+44%6+'%$,.")1%#"2)
      0+($"($)1%#"2)
                                                    3,4$"',(5)
                                                                                                                                                                                               ou
        3,4$"',(5)
                                             )))67,8,#%)+,4%$91$'%4)-1":))))
                                                                                                                                                                                               suc
  !"#$%&"'()*+,#&-,.)
  /$%,0"12()*3$4%)3""5.)
                                             ))))1,;&,<4)<1&%%,')=2)4&:&8$1))
                                             )))))))))))%$4%,5)94,14>?)                                                                                                    <',7)41$
                                                                                                                                                                                               pro
                                                                                                                                                                          8&=,%*1,'>$
                                                                                                                                                                                               exp
                                                                                                                  8&4,99&0731*,0$:0;*0&$                        !B#$B*%1$,2$D4,'&7$<',7)41%$
                                                                                                                                                                !(#$()&'*&%$
                                                                                                                                                                                               ma
                                                                                                                                                                           8&?*&@$
                                                                                                                                                                                               we
                                                                                       Fig. 2.   User Reviews from Foursquare Social Network                              8&=,%*1,'>$
                                                                                                                                                                                               com
                                  7"$%)
                              !"8+99"(2"'))
                                                                                                                                     !8#$830E&7$<',7)41%$
                                                                                   The content-based filtering approach will be used to filter                                                   ext
                                                                                the product/service repository, while the collaborative based
                                                                                                                        8&%).1%$                                                               B.
                                                                                approach will derive the product review recommendations. In
                                                                                addition we will use text mining techniques to distinct the
                               !"8+99"(2%$,+(#)                                 polarity of the user review between positive or negative one.
                                                                                This information summarized would contribute in the product Architecture
                                                                                                   Fig. 3. Mobile Recommender System                                                           rat
                                                                                score recommendation computation. The final product recom-
                Fig. 1.    Meta Recommender Architecture
                                                                                mendation score is computed by integrating the result of both
                                                                                                                                                                                               me
                                                                                recommenders. By now, weproduct/service recommender, the user could
                                                                                        In our mobile are considering to use different                                                         and
   Since one of the goals of this work is to incorporate                        options regarding this integration approach, one and get a list of recommen-
different data sources of user opinions and descriptions, we                         filter some products or services at special                                                                oth
                                                                                is the symbolic data analysis approach (SDA) [19], which
have addopted an meta recommendation architecture. By using                     eachtations. The user user ratings/reviews arehis preferences or give his
                                                                                      product description and also can enter modeled                                                           ow
a meta recommender architecture, the system would provide
a personalized control over the generated recommendation list
                                                                                     feedback to some offered product recommendation.
                                                                                as set of modal symbolic descriptions that summarizes the                                                      Re
                                                                                information provided by the corresponding data sources. It is
Crab is already in production

  Brazilian Social Network called Atepassar.com
         Educational network with more than 60.000 students and 120 video-classes




     Running on Python
    + Numpy + Scipy and
          Django


Backend for Recommendations
MongoDB - mongoengine

   Daily Recommendations
    with Explanations
Evaluating your recommender
 Crab implements the most used recommender metrics.
     Precision, Recall, F1-Score, RMSE



     Using matplotlib
     for a plotter utility

 Implement new metrics

Simulations support maybe (??)
Evaluating your recommender
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
    ({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568},
          {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788},
          {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}],
 'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677},
   {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955},
  {'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]},
           {'final_score': {'avg': {'f1score': 0.495955,
                            'mae': 0.429292,
                           'nmae': 0.373739,
                        'precision': 0.63932929,
                         'recall': 0.729939393,
                          'rmse': 0.3466868},
                  'stdev': {'f1score': 0.09938383 ,
                           'mae': 0.0593933,
                          'nmae': 0.03393939,
                        'precision': 0.0192929,
                         'recall': 0.031293939,
                        'rmse': 0.234949494}}})
Distributing the recommendation computations


Use Hadoop and Map-Reduce intensively
  Investigating the Yelp mrjob framework     https://github.com/pfig/mrjob



Develop the Netflix and novel standard-of-the-art used
    Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines



The most commonly used is Slope One technique.
   Simple algebra math with slope one algebra y = a*x+b
Cache/Paralelism with joblib
                         http://packages.python.org/joblib/index.html


 from joblib import Memory
 memory = Memory(cachedir=’’, verbose=0)

 class UserSimilarity(BaseSimilarity):
     ...

        @memory.cache 
        def get_similarity(self, source_id, target_id):
             source_preferences = self.model.preferences_from_user(source_id)
             target_preferences = self.model.preferences_from_user(target_id)
              ...
              return self.distance(source_preferences, target_preferences) 
                  if not source_preferences.shape[1] == 0 
                      and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

          def get_similarities(self, source_id):
              return[(other_id, self.get_similarity(source_id, other_id))
                                for other_id, v in self.model]
Cache/Paralelism with joblib
                            http://packages.python.org/joblib/index.html


    from joblib import Memory
    memory = Memory(cachedir=’’, verbose=0)

    class UserSimilarity(BaseSimilarity):
        ...

           @memory.cache 
           def get_similarity(self, source_id, target_id):
                source_preferences = self.model.preferences_from_user(source_id)
                target_preferences = self.model.preferences_from_user(target_id)
                 ...
                 return self.distance(source_preferences, target_preferences) 
                     if not source_preferences.shape[1] == 0 
                         and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

             def get_similarities(self, source_id):
                 return[(other_id, self.get_similarity(source_id, other_id))
                                   for other_id, v in self.model]


>>> #Without memory.cache
Cache/Paralelism with joblib
                            http://packages.python.org/joblib/index.html


    from joblib import Memory
    memory = Memory(cachedir=’’, verbose=0)

    class UserSimilarity(BaseSimilarity):
        ...

           @memory.cache 
           def get_similarity(self, source_id, target_id):
                source_preferences = self.model.preferences_from_user(source_id)
                target_preferences = self.model.preferences_from_user(target_id)
                 ...
                 return self.distance(source_preferences, target_preferences) 
                     if not source_preferences.shape[1] == 0 
                         and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

             def get_similarities(self, source_id):
                 return[(other_id, self.get_similarity(source_id, other_id))
                                   for other_id, v in self.model]


>>> #Without memory.cache                     >>># With memory.cache
Cache/Paralelism with joblib
                             http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                      >>># With memory.cache
>>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)
Cache/Paralelism with joblib
                             http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                      >>># With memory.cache
>>> timeit similarity.get_similarities          >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                            (‘marcel_caraciolo’)
Cache/Paralelism with joblib
                               http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                       >>># With memory.cache
>>> timeit similarity.get_similarities           >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                             (‘marcel_caraciolo’)
   100 loops, best of 3: 978 ms per loop
Cache/Paralelism with joblib
                               http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                       >>># With memory.cache
>>> timeit similarity.get_similarities           >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                             (‘marcel_caraciolo’)
   100 loops, best of 3: 978 ms per loop             100 loops, best of 3: 434 ms per loop
Cache/Paralelism with joblib
                      http://packages.python.org/joblib/index.html




 Investigate how to use multiprocessing and parallel packages with similarities
                                  computation




    from joblib import Parallel
    ...

    def get_similarities(self, source_id):
        return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity)
            (source_id, other_id)) for other_id, v in self.model)
Distributed Computing with mrJob
         https://github.com/Yelp/mrjob
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob


                                                """The classic MapReduce job: count the frequency of words.
                                                """
                                                from mrjob.job import MRJob
                                                import re

                                                WORD_RE = re.compile(r"[w']+")

                                                class MRWordFreqCount(MRJob):

                                                    def mapper(self, _, line):
                                                        for word in WORD_RE.findall(line):
                                                            yield (word.lower(), 1)

                                                    def reducer(self, word, counts):
                                                        yield (word, sum(counts))

                                                if __name__ == '__main__':
                                                    MRWordFreqCount.run()




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                                         https://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Distributed Computing with mrJob
                                         https://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Future studies with Sparse Matrices
 Real datasets come with lots of empty values
  http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



Solutions:

       scipy.sparse package

       Sharding operations

       Matrix Factorization
        techniques (SVD)




                                                  Apontador Reviews Dataset
Future studies with Sparse Matrices
     Real datasets come with lots of empty values
      http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



   Solutions:

          scipy.sparse package

          Sharding operations

          Matrix Factorization
           techniques (SVD)




  Crab implements a Matrix
Factorization with Expectation
   Maximization algorithm

                                                      Apontador Reviews Dataset
Future studies with Sparse Matrices
     Real datasets come with lots of empty values
      http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



   Solutions:

          scipy.sparse package

          Sharding operations

          Matrix Factorization
           techniques (SVD)




  Crab implements a Matrix
Factorization with Expectation
   Maximization algorithm
      scikits.crab.svd package
                                                      Apontador Reviews Dataset
Optimizations with Cython
                                          http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.




                      http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Optimizations with Cython
                                                   http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython",
 ["spearman_correlation_cython.pyx"])]

)


                            http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Optimizations with Cython
                                                   http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython",
 ["spearman_correlation_cython.pyx"])]

)


                            http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Benchmarks

                                    Pure Python w/   Python w/ Scipy
       Dataset
                                         dicts         and Numpy
MovieLens 100k                         15.32 s           9.56 s
 http://www.grouplens.org/node/73



                                       Old Crab         New Crab
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Why migrate ?
Old Crab running only using Pure Python
     Recommendations demand heavy maths calculations and lots of processing

Compatible with Numpy and Scipy libraries
   High Standard and popular scientific libraries optimized for scientific calculations in Python

Scikits projects are amazing!
    Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)

Turn the Crab framework visible for the community
 Join the scientific researchers and machine learning developers around the Globe coding with
                                 Python to help us in this project


                              Be Fast and Furious
Why migrate ?



Numpy optimized with PyPy

     2.x - 48.x Faster



  http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
How are we working ?
            Sprints, Online Discussions and Issues




https://github.com/muricoca/crab/wiki/UpcomingEvents
How are we working ?
      Our Project’s Home Page




http://muricoca.github.com/crab
Future Releases
       Planned Release 0.1
   Collaborative Filtering Algorithms working, sample datasets to load and test


       Planned Release 0.11
                Sparse Matrixes and Database Models support


       Planned Release 0.12
                Slope One Agorithm, new factorization techniques implemented



....
Join us!

1. Read our Wiki Page
    https://github.com/muricoca/crab/wiki/Developer-Resources

2. Check out our current sprints and open issues
    https://github.com/muricoca/crab/issues

3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
                     discussion list
                  http://groups.google.com/group/scikit-crab
Recommended Books




Toby Segaran, Programming Collective   SatnamAlag, Collective Intelligence in
Intelligence, O'Reilly, 2007           Action, Manning Publications, 2009



   ACM RecSys, KDD , SBSC...
Crab
              A Python Framework for Building
                  Recommendation Engines

           https://github.com/muricoca/crab

Marcel Caraciolo Ricardo Caspirro                            Bruno Melo
   @marcelcaraciolo           @ricardocaspirro                 @brunomelo

                      {marcel, ricardo,bruno}@muricoca.com

Weitere ähnliche Inhalte

Was ist angesagt?

Moose workshop
Moose workshopMoose workshop
Moose workshopYnon Perek
 
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
Tom Critchlow - Data Feed SEO & Advanced Site ArchitectureTom Critchlow - Data Feed SEO & Advanced Site Architecture
Tom Critchlow - Data Feed SEO & Advanced Site Architectureauexpo Conference
 
OO Perl with Moose
OO Perl with MooseOO Perl with Moose
OO Perl with MooseNelo Onyiah
 
Moose talk at FOSDEM 2011 (Perl devroom)
Moose talk at FOSDEM 2011 (Perl devroom)Moose talk at FOSDEM 2011 (Perl devroom)
Moose talk at FOSDEM 2011 (Perl devroom)xSawyer
 
Writing and Sharing Great Modules with the Puppet Forge
Writing and Sharing Great Modules with the Puppet ForgeWriting and Sharing Great Modules with the Puppet Forge
Writing and Sharing Great Modules with the Puppet ForgePuppet
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxCoffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxLucidworks
 

Was ist angesagt? (6)

Moose workshop
Moose workshopMoose workshop
Moose workshop
 
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
Tom Critchlow - Data Feed SEO & Advanced Site ArchitectureTom Critchlow - Data Feed SEO & Advanced Site Architecture
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
 
OO Perl with Moose
OO Perl with MooseOO Perl with Moose
OO Perl with Moose
 
Moose talk at FOSDEM 2011 (Perl devroom)
Moose talk at FOSDEM 2011 (Perl devroom)Moose talk at FOSDEM 2011 (Perl devroom)
Moose talk at FOSDEM 2011 (Perl devroom)
 
Writing and Sharing Great Modules with the Puppet Forge
Writing and Sharing Great Modules with the Puppet ForgeWriting and Sharing Great Modules with the Puppet Forge
Writing and Sharing Great Modules with the Puppet Forge
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxCoffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
 

Andere mochten auch

Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNNŞeyda Hatipoğlu
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 

Andere mochten auch (8)

Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNN
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 

Ähnlich wie Crab: A Python Framework for Building Recommender Systems

Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsMarcel Caraciolo
 
Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Damien Seguy
 
Symfony & Javascript. Combining the best of two worlds
Symfony & Javascript. Combining the best of two worldsSymfony & Javascript. Combining the best of two worlds
Symfony & Javascript. Combining the best of two worldsIgnacio Martín
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentMike Brittain
 
Semantic search for Earth Observation products
Semantic search for Earth Observation productsSemantic search for Earth Observation products
Semantic search for Earth Observation productsGasperi Jerome
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
 
Architectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwareArchitectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwarePooyan Jamshidi
 
CoffeeScript Design Patterns
CoffeeScript Design PatternsCoffeeScript Design Patterns
CoffeeScript Design PatternsTrevorBurnham
 
Cheapass.in — presented at JSFoo 2016
Cheapass.in — presented at JSFoo 2016Cheapass.in — presented at JSFoo 2016
Cheapass.in — presented at JSFoo 2016Aakash Goel
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passageErik Rose
 
Socket applications
Socket applicationsSocket applications
Socket applicationsJoão Moura
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?Roman Elizarov
 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Matt Raible
 
Automated release management - DevConFu 2014
Automated release management - DevConFu 2014Automated release management - DevConFu 2014
Automated release management - DevConFu 2014Kristoffer Deinoff
 
Choosing JavaScript Libraries - Refresh-Detroit.org
Choosing JavaScript Libraries - Refresh-Detroit.orgChoosing JavaScript Libraries - Refresh-Detroit.org
Choosing JavaScript Libraries - Refresh-Detroit.orgChris Lee
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用Qiangning Hong
 
What's new in Puppet 3.0
What's new in Puppet 3.0What's new in Puppet 3.0
What's new in Puppet 3.0Eric Sorenson
 
Monkeybars in the Manor
Monkeybars in the ManorMonkeybars in the Manor
Monkeybars in the Manormartinbtt
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 

Ähnlich wie Crab: A Python Framework for Building Recommender Systems (20)

Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender Systems
 
Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)
 
Symfony & Javascript. Combining the best of two worlds
Symfony & Javascript. Combining the best of two worldsSymfony & Javascript. Combining the best of two worlds
Symfony & Javascript. Combining the best of two worlds
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Semantic search for Earth Observation products
Semantic search for Earth Observation productsSemantic search for Earth Observation products
Semantic search for Earth Observation products
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
Architectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwareArchitectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based Software
 
CoffeeScript Design Patterns
CoffeeScript Design PatternsCoffeeScript Design Patterns
CoffeeScript Design Patterns
 
Cheapass.in — presented at JSFoo 2016
Cheapass.in — presented at JSFoo 2016Cheapass.in — presented at JSFoo 2016
Cheapass.in — presented at JSFoo 2016
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passage
 
Socket applications
Socket applicationsSocket applications
Socket applications
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?
 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
 
Automated release management - DevConFu 2014
Automated release management - DevConFu 2014Automated release management - DevConFu 2014
Automated release management - DevConFu 2014
 
Choosing JavaScript Libraries - Refresh-Detroit.org
Choosing JavaScript Libraries - Refresh-Detroit.orgChoosing JavaScript Libraries - Refresh-Detroit.org
Choosing JavaScript Libraries - Refresh-Detroit.org
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用
 
What's new in Puppet 3.0
What's new in Puppet 3.0What's new in Puppet 3.0
What's new in Puppet 3.0
 
Monkeybars in the Manor
Monkeybars in the ManorMonkeybars in the Manor
Monkeybars in the Manor
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 

Mehr von Marcel Caraciolo

Como interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonComo interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonMarcel Caraciolo
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)Marcel Caraciolo
 
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Construindo softwares  de bioinformática  para análises clínicas : Desafios e...Construindo softwares  de bioinformática  para análises clínicas : Desafios e...
Construindo softwares de bioinformática para análises clínicas : Desafios e...Marcel Caraciolo
 
Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Marcel Caraciolo
 
Como Python pode ajudar na automação do seu laboratório
Como Python pode ajudar na automação do  seu laboratórioComo Python pode ajudar na automação do  seu laboratório
Como Python pode ajudar na automação do seu laboratórioMarcel Caraciolo
 
Python on Science ? Yes, We can.
Python on Science ?   Yes, We can.Python on Science ?   Yes, We can.
Python on Science ? Yes, We can.Marcel Caraciolo
 
Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Marcel Caraciolo
 
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Marcel Caraciolo
 
Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Marcel Caraciolo
 
Benchy, python framework for performance benchmarking of Python Scripts
Benchy, python framework for performance benchmarking  of Python ScriptsBenchy, python framework for performance benchmarking  of Python Scripts
Benchy, python framework for performance benchmarking of Python ScriptsMarcel Caraciolo
 
Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Marcel Caraciolo
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
 
Python, A pílula Azul da programação
Python, A pílula Azul da programaçãoPython, A pílula Azul da programação
Python, A pílula Azul da programaçãoMarcel Caraciolo
 
Construindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceConstruindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceMarcel Caraciolo
 
Como Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilComo Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilMarcel Caraciolo
 
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Marcel Caraciolo
 
Aula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosAula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosMarcel Caraciolo
 
Arquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosArquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosMarcel Caraciolo
 
PyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquarePyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquareMarcel Caraciolo
 

Mehr von Marcel Caraciolo (20)

Como interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonComo interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com Python
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)
 
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Construindo softwares  de bioinformática  para análises clínicas : Desafios e...Construindo softwares  de bioinformática  para análises clínicas : Desafios e...
Construindo softwares de bioinformática para análises clínicas : Desafios e...
 
Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2
 
Como Python pode ajudar na automação do seu laboratório
Como Python pode ajudar na automação do  seu laboratórioComo Python pode ajudar na automação do  seu laboratório
Como Python pode ajudar na automação do seu laboratório
 
Python on Science ? Yes, We can.
Python on Science ?   Yes, We can.Python on Science ?   Yes, We can.
Python on Science ? Yes, We can.
 
Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3
 
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)
 
Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?
 
Big Data com Python
Big Data com PythonBig Data com Python
Big Data com Python
 
Benchy, python framework for performance benchmarking of Python Scripts
Benchy, python framework for performance benchmarking  of Python ScriptsBenchy, python framework for performance benchmarking  of Python Scripts
Benchy, python framework for performance benchmarking of Python Scripts
 
Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks
 
Python, A pílula Azul da programação
Python, A pílula Azul da programaçãoPython, A pílula Azul da programação
Python, A pílula Azul da programação
 
Construindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceConstruindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduce
 
Como Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilComo Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no Brasil
 
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
 
Aula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosAula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursos
 
Arquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosArquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursos
 
PyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquarePyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for Foursquare
 

Kürzlich hochgeladen

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Crab: A Python Framework for Building Recommender Systems

  • 1. Crab A Python Framework for Building Recommendation Engines PythonBrasil 2011, São Paulo, SP Marcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo
  • 2. What is Crab ? A python framework for building recommendation engines A Scikit module for collaborative, content and hybrid filtering Mahout Alternative for Python Developers :D Open-Source under the BSD license https://github.com/muricoca/crab
  • 3. When started ? It began one year ago Community-driven, 4 members Since April,2011 the open-source labs Muriçoca incorporated it Since April,2011 rewritting it as Scikit https://github.com/muricoca/
  • 4. Knowing Scikits Scikits are Scipy Toolkits - independent and projects hosted under a common namespace. Scikits Image Scikits MlabWrap Scikits AudioLab Scikit Learn .... http://scikits.appspot.com/scikits
  • 5. Knowing Scikits Scikit-Learn Machine Learning Algorithms + scientific Python packages (Numpy, Scipy and Matplotlib) http://scikit-learn.sourceforge.net/ Our goal: Incorporate the Crab as Scikit and incorporate some parts of them at Scikit-learn
  • 6. Why Recommendations ? The world is an over-crowded place !"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#
  • 7. Why Recommendations * +,&-.$/).#&0#/"1.#$%234(".# ? $/)#5(&6 7&.2.#"$4,#)$8 We are overloaded * 93((3&/.#&0#:&'3".;#5&&<.# $/)#:-.34#2%$4<.#&/(3/" Thousands of news articles and blog posts each day * =/#>$/&3;#?#@A#+B#4,$//"(.;# 2,&-.$/).#&0#7%&6%$:.# Millions of movies, books and music tracks online "$4,#)$8 Several Places, Offers and Events * =/#C"1#D&%<;#."'"%$(# Even Friends sometimes we are overloaded ! 2,&-.$/).#&0#$)#:"..$6".# ."/2#2&#-.#7"%#)$8
  • 8. Why Recommendations ? We really need and consume only a few of them! “A lot of times, people don’t know what they want until you show it to them.” Steve Jobs “We are leaving the Information age, and entering into the Recommendation age.” Chris Anderson, from book Long Tail
  • 9. Why Recommendations ? Can Google help ? Yes, but only when we really know what we are looking for But, what’s does it mean by “interesting” ? Can Facebook help ? Yes, I tend to find my friends’ stuffs interesting What if i had only few friends and what they like do not always attract me ? Can experts help ? Yes, but it won’t scale well. But it is what they like, not me! Exactly same advice!
  • 10. Why Recommendations ? Recommendation Systems Systems designed to recommend to me something I may like
  • 11. Why Recommendations ? !"#$%&"'$"'(')*#*+,) Recommendation Systems -+*#)+. -#/') 0#)1# ! 2' 23&4"+')1 5,6 7),*%'"&863 Graph Representation
  • 12. The current Crab Collaborative Filtering algorithms User-Based, Item-Based and Factorization Matrix (SVD) Evaluation of the Recommender Algorithms Precision, Recall, F1-Score, RMSE Precision-Recall Charts
  • 13. The current Crab Precision-Recall Charts
  • 14. Collaborative Filtering O Vento Toy Thor Armagedon Items Levou Store like recommends Marcel Rafael Amanda Users Similar
  • 17. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies
  • 18. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies()
  • 19. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies() >>> data
  • 20. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies() >>> data {'DESCR': 'sample_movies data set was collected by the book called nProgramming the Collective Intelligence by Toby Segaran nnNotesn----- nThis data set consists ofnt* n ratings with (1-5) from n users to n movies.',  'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},   2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},   3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},   4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},   5: {2: 4.5, 3: 1.0, 4: 4.0},   6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},   7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}},  'item_ids': {1: 'Lady in the Water',   2: 'Snakes on a Planet',   3: 'You, Me and Dupree',   4: 'Superman Returns',   5: 'The Night Listener',   6: 'Just My Luck'},  'user_ids': {1: 'Jack Matthews',   2: 'Mick LaSalle',   3: 'Claudia Puig',   4: 'Lisa Rose',   5: 'Toby',   6: 'Gene Seymour',   7: 'Michael Phillips'}}
  • 22. The current Crab >>> from crab.models import MatrixPreferenceDataModel
  • 23. The current Crab >>> from crab.models import MatrixPreferenceDataModel >>> m = MatrixPreferenceDataModel(data.data)
  • 24. The current Crab >>> from crab.models import MatrixPreferenceDataModel >>> m = MatrixPreferenceDataModel(data.data) >>> print m MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 ... 1 3.000000 4.000000 3.500000 5.000000 3.000000 2 3.000000 4.000000 2.000000 3.000000 3.000000 3 --- 3.500000 2.500000 4.000000 4.500000 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- 6 3.000000 3.500000 3.500000 5.000000 3.000000 7 2.500000 3.000000 --- 3.500000 4.000000
  • 26. The current Crab >>> #import pairwise distance
  • 27. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances
  • 28. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity
  • 29. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity
  • 30. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances)
  • 31. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1]
  • 32. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]
  • 33. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0), (6, 0.66666666666666663), MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 (4, 0.34054242658316669), 1 3.000000 4.000000 3.500000 5.000000 3.000000 (3, 0.32037724101704074), 2 3.000000 4.000000 2.000000 3.000000 3.000000 3 --- 3.500000 2.500000 4.000000 4.500000 (7, 0.32037724101704074), 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- (2, 0.2857142857142857), 6 3.000000 3.500000 3.500000 5.000000 3.000000 (5, 0.2674788903885893)] 7 2.500000 3.000000 --- 3.500000 4.000000
  • 35. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender
  • 36. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
  • 37. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]])
  • 38. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]]) >>> recsys.recommended_because(user_id=5,item_id=1) array([[ 2. , 3. ],        [ 1. , 3. ],        [ 6. , 3. ],        [ 7. , 2.5],        [ 4. , 2.5]])
  • 39. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]]) >>> recsys.recommended_because(user_id=5,item_id=1) array([[ 2. , 3. ],        [ 1. , 3. ], MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 ...        [ 6. , 3. ], 1 3.000000 4.000000 3.500000 5.000000 3.000000 2 3.000000 4.000000 2.000000 3.000000 3.000000        [ 7. , 2.5], 3 --- 3.500000 2.500000 4.000000 4.500000        [ 4. , 2.5]]) 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- 6 3.000000 3.500000 3.500000 5.000000 3.000000 7 2.500000 3.000000 --- 3.500000 4.000000
  • 40. The current Crab Using REST APIs to deploy the recommender django-piston, django-rest, django-tastypie
  • 41. Crab is already in production News from Abril Publisher recommendations! Collecting over 10 magazines, 20 books and 100+ articles Running on Python + Scipy + Django Content-Based-Filtering Easy-to-use interface Still in development
  • 42. Content Based Filtering Similar Duro de O Vento Toy Armagedon Items Matar Levou Store recommend likes Marcel Users
  • 43. Crab is already in production PythonBrasil keynotes Recommender Recommending keynotes based on a hybrid approach Running on Python + Scipy + Django Content-Based-Filtering + Collaborative Filtering Schedule your keynotes Still in development
  • 44. source, the recommendation architecture that we propose will would rely more on collaborative-filtering techniques, that is, aggregate the results of such filtering techniques. Bezerra and Carvalho proposed approaches where the results the reviews from similar users. We aim at integrating the previously mentioned hybrid prod- Figure 1 shows a overview of our meta recommender achieved showed to be very promising [19]. approach. By combining the content-based filtering and the uct recommendation approach in a mobile application so the A. Crab is already in production users could benefit from useful and logical recommendations. collaborative-based one into a hybrid recommender system, it Moreover, we aim at providing a suited explanation for each would use the services/products III. S YSTEM catalogues repositories which D ESIGN recommendation to the user, since the current approaches just the services to be recommended, and the review repository Application data information our mobile recommender sys- that contains the user opinions about those services. All this for only deliver product recommendations with a overall score without pointing out the appropriateness of such recommen- datatembecan be from data source containers in the web product description can extracted divided into two parts: the rec dation [13]. Besides the basic information provided by the such(such location-based social network Foursquare its attributes) and the user as the as location, description and [17] as Hybrid Meta Approach gives the system’s architecture and suppliers, the system will deliver the explanation, providing relevant reviews of similar users, we believe that it will tags, etc.). The Figure 3 increase the confidence in the buying decision process and the displayed at the Figure 2 and the location recommendation engine from Google: Google HotPot [18]. by user (such as rating, comments, reviews or ratings provided mo wh product accepptance rate. In the mobile context this approach po could help the users in this process and showing the user relative components. thi opinions could contribute to achieve this task. rec spe !"#$"%&'$ 5&-$ !"#$%&'%($) !".,"/#) acc !"*+#,$+'-) !"*+#,$+'-) +,-*.&$ !(#$()&'*&%$ /01&'234&$ !6#$6,00&41&7$ wh res !<#$<'&2&'&04&%A$B,431*,0A$&14C$ ves 0+44%6+'%$,.")1%#"2) 0+($"($)1%#"2) 3,4$"',(5) ou 3,4$"',(5) )))67,8,#%)+,4%$91$'%4)-1":)))) suc !"#$%&"'()*+,#&-,.) /$%,0"12()*3$4%)3""5.) ))))1,;&,<4)<1&%%,')=2)4&:&8$1)) )))))))))))%$4%,5)94,14>?) <',7)41$ pro 8&=,%*1,'>$ exp 8&4,99&0731*,0$:0;*0&$ !B#$B*%1$,2$D4,'&7$<',7)41%$ !(#$()&'*&%$ ma 8&?*&@$ we Fig. 2. User Reviews from Foursquare Social Network 8&=,%*1,'>$ com 7"$%) !"8+99"(2"')) !8#$830E&7$<',7)41%$ The content-based filtering approach will be used to filter ext the product/service repository, while the collaborative based 8&%).1%$ B. approach will derive the product review recommendations. In addition we will use text mining techniques to distinct the !"8+99"(2%$,+(#) polarity of the user review between positive or negative one. This information summarized would contribute in the product Architecture Fig. 3. Mobile Recommender System rat score recommendation computation. The final product recom- Fig. 1. Meta Recommender Architecture mendation score is computed by integrating the result of both me recommenders. By now, weproduct/service recommender, the user could In our mobile are considering to use different and Since one of the goals of this work is to incorporate options regarding this integration approach, one and get a list of recommen- different data sources of user opinions and descriptions, we filter some products or services at special oth is the symbolic data analysis approach (SDA) [19], which have addopted an meta recommendation architecture. By using eachtations. The user user ratings/reviews arehis preferences or give his product description and also can enter modeled ow a meta recommender architecture, the system would provide a personalized control over the generated recommendation list feedback to some offered product recommendation. as set of modal symbolic descriptions that summarizes the Re information provided by the corresponding data sources. It is
  • 45. Crab is already in production Brazilian Social Network called Atepassar.com Educational network with more than 60.000 students and 120 video-classes Running on Python + Numpy + Scipy and Django Backend for Recommendations MongoDB - mongoengine Daily Recommendations with Explanations
  • 46. Evaluating your recommender Crab implements the most used recommender metrics. Precision, Recall, F1-Score, RMSE Using matplotlib for a plotter utility Implement new metrics Simulations support maybe (??)
  • 48. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator
  • 49. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator()
  • 50. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse')
  • 51. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907}
  • 52. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907} >>> evaluator.evaluate_on_split(recommender=recsys, at =2)
  • 53. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907} >>> evaluator.evaluate_on_split(recommender=recsys, at =2) ({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}], 'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677}, {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955}, {'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]}, {'final_score': {'avg': {'f1score': 0.495955, 'mae': 0.429292, 'nmae': 0.373739, 'precision': 0.63932929, 'recall': 0.729939393, 'rmse': 0.3466868}, 'stdev': {'f1score': 0.09938383 , 'mae': 0.0593933, 'nmae': 0.03393939, 'precision': 0.0192929, 'recall': 0.031293939, 'rmse': 0.234949494}}})
  • 54. Distributing the recommendation computations Use Hadoop and Map-Reduce intensively Investigating the Yelp mrjob framework https://github.com/pfig/mrjob Develop the Netflix and novel standard-of-the-art used Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines The most commonly used is Slope One technique. Simple algebra math with slope one algebra y = a*x+b
  • 55. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
  • 56. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache
  • 57. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache
  • 58. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities (‘marcel_caraciolo’)
  • 59. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’)
  • 60. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop
  • 61. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop
  • 62. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html Investigate how to use multiprocessing and parallel packages with similarities computation from joblib import Parallel ... def get_similarities(self, source_id):         return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity) (source_id, other_id)) for other_id, v in self.model)
  • 63. Distributed Computing with mrJob https://github.com/Yelp/mrjob
  • 64. Distributed Computing with mrJob https://github.com/Yelp/mrjob It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 65. Distributed Computing with mrJob https://github.com/Yelp/mrjob It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 66. Distributed Computing with mrJob https://github.com/Yelp/mrjob """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob):     def mapper(self, _, line):         for word in WORD_RE.findall(line):             yield (word.lower(), 1)     def reducer(self, word, counts):         yield (word, sum(counts)) if __name__ == '__main__':     MRWordFreqCount.run() It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 67. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 68. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 69. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Apontador Reviews Dataset
  • 70. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a Matrix Factorization with Expectation Maximization algorithm Apontador Reviews Dataset
  • 71. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a Matrix Factorization with Expectation Maximization algorithm scikits.crab.svd package Apontador Reviews Dataset
  • 72. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 73. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. # setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext # for notes on compiler flags see: # http://docs.python.org/install/index.html setup( cmdclass = {'build_ext': build_ext}, ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])] ) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 74. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. # setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext # for notes on compiler flags see: # http://docs.python.org/install/index.html setup( cmdclass = {'build_ext': build_ext}, ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])] ) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 75. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab
  • 76. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 77. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 78. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 79. Why migrate ? Old Crab running only using Pure Python Recommendations demand heavy maths calculations and lots of processing Compatible with Numpy and Scipy libraries High Standard and popular scientific libraries optimized for scientific calculations in Python Scikits projects are amazing! Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn) Turn the Crab framework visible for the community Join the scientific researchers and machine learning developers around the Globe coding with Python to help us in this project Be Fast and Furious
  • 80. Why migrate ? Numpy optimized with PyPy 2.x - 48.x Faster http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
  • 81. How are we working ? Sprints, Online Discussions and Issues https://github.com/muricoca/crab/wiki/UpcomingEvents
  • 82. How are we working ? Our Project’s Home Page http://muricoca.github.com/crab
  • 83. Future Releases Planned Release 0.1 Collaborative Filtering Algorithms working, sample datasets to load and test Planned Release 0.11 Sparse Matrixes and Database Models support Planned Release 0.12 Slope One Agorithm, new factorization techniques implemented ....
  • 84. Join us! 1. Read our Wiki Page https://github.com/muricoca/crab/wiki/Developer-Resources 2. Check out our current sprints and open issues https://github.com/muricoca/crab/issues 3. Forks, Pull Requests mandatory 4. Join us at irc.freenode.net #muricoca or at our discussion list http://groups.google.com/group/scikit-crab
  • 85. Recommended Books Toby Segaran, Programming Collective SatnamAlag, Collective Intelligence in Intelligence, O'Reilly, 2007 Action, Manning Publications, 2009 ACM RecSys, KDD , SBSC...
  • 86. Crab A Python Framework for Building Recommendation Engines https://github.com/muricoca/crab Marcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo {marcel, ricardo,bruno}@muricoca.com