SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Àíàëèç òîíàëüíîñòè òåêñòîâ


                                 Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com




                                                           Ìîñêâà, 2013




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ    Ìîñêâà, 2013   1 / 41
×àñòü 1. Experience project
  Îïèñàíèå


   Experience project: ïðîåêò, â ðàìêàõ êîòîðîãî ïîëüçîâàòåëè äåëÿòñÿ ñâîèìè èñòîðèÿìè.
   Êàæäîé èñòîðèè ÷èòàòåëè ìîãóò âûñòàâèòü îäíó èç ïÿòè êàòåãîðèé è íàïèñàòü êîììåíòàðèé.




   Ïðèìåð: I really hate being shy ... I just want to be able to talk to someone about anything and
   everything and be myself ... That's all I've ever wanted.

         Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;

         Author age: 21

         Author gender:female

         Text group: friends




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ       Ìîñêâà, 2013   2 / 41
×àñòü 1. Äàííûå



   Çàãðóçèì äàííûå:

   e p = r e a d . c s v ( ' e p 3−c o n t e x t . c s v ' )




   Çäåñü: Count - âñòðå÷àåìîñòü äàííîãî ñëîâà â ñîîòâåòñòâóþùåé Category, Group è ïðè óêàçàííûõ
   Genger, Age
   Total - îáùåå êîëè÷åñòâî ñëîâ ñîîòâåòñòâóþùåé Category, Group è ïðè óêàçàííûõ Genger, Age




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   3 / 41
×àñòü 1. Ðàáîòà ñ äàííûìè



   Ìîæíî ïîëó÷èòü äàííûå ïî ëþáîìó èç ïàðàìåòðîâ:

   l e v e l s ( ep$Word )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   4 / 41
×àñòü 1. Ñëîâà è êàòåãîðèè




   Ïîñìîòðèì, êàê ñîîòíîñÿòñÿ ñëîâà ñ êàòåãîðèÿìè:

   f u n n y = e p C o l l a p s e d F r a m e ( ep ,   ' funny ' )
   p l o t ( funny$Category ,              funny$Count ,         x l a b =' C a t e g o r y ' ,   y l a b =' Count ' ,   main ='
         funny ' )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ                                 Ìîñêâà, 2013   5 / 41
×àñòü 1. Ñëîâà è êàòåãîðèè




                                                  Ñòðàííî, ïðàâäà? À ÷òî íå òàê?
Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   6 / 41
×àñòü 1. Ñëîâà è êàòåãîðèè




   Íóæíà íîðìàëèçàöèÿ íà ðàçìåð êàòåãîðèè!

   funny$Count            /   funny$Total
   f u n n y = e p C o l l a p s e d F r a m e ( ep ,   ' funny ' ,     f r e q s=TRUE)
   p l o t ( funny$Category ,              funny$Freq ,        x l a b =' C a t e g o r y ' ,   y l a b =' Count / T o t a l ' ,
         main =' f u n n y ' )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ                                  Ìîñêâà, 2013   7 / 41
×àñòü 1. Ñëîâà è êàòåãîðèè




                                                          Ãîðàçäî ëó÷øå!
Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   8 / 41
×àñòü 1. Ê òåîðèè âåðîÿòíîñòåé




   Freq ñîîòâåòñòâóåò óñëîâíîé âåðîÿòíîñòè P(word|category). Ýòî çíà÷åíèå âåñüìà ìàëî, ïîýòîìó
   ïîñ÷èòàåì âåðîÿòíîñòü P(category|word) (âñå ïîìíÿò ôîðìóëó Áàéåñà?:-) ) .

   funny$Freq           /   sum ( f u n n y $ F r e q )
   f u n n y = e p C o l l a p s e d F r a m e ( ep ,     ' funny ' ,      f r e q s =TRUE ,      p r o b s=TRUE)
   p l o t ( funny$Category ,              funny$Pr ,         x l a b =' C a t e g o r y ' ,   y l a b = ' ( C o u n t / T o t a l ) / sum (
         Count / T o t a l ) ' ,      main =' f u n n y ' )


   Âîïðîñ: Êàêèå åùå ñëîâà ìîãóò õîðîøî õàðàêòåðèçîâàòü êàêóþ-ëèáî èç êàòåãîðèé?




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ                                     Ìîñêâà, 2013    9 / 41
×àñòü 1. Ñðàâíèì çíà÷åíèÿ ñ îæèäàåìîé âåðîÿòíîñòüþ




   f u n n y = e p C o l l a p s e d F r a m e ( ep ,    ' funny ' ,      f r e q s =TRUE ,   p r o b s=TRUE ,    o e=TRUE )


   Êàê ïîñ÷èòàòü îæèäàåìîå çíà÷åíèå?

   c a t e g o r y . p r o b s = ( f u n n y $ T o t a l / sum ( f u n n y $ T o t a l ) )
   f u n n y . c o u n t = sum ( f u n n y $ C o u n t )
   funny . expected            = funny . count          ∗   category . probs
   funny . expected


   Ïîñìîòðèì íà îòíîøåíèå:

   ( funny$observed               /   funny . expected )         −   1


   Åñëè îíî ìåíüøå 0 - ñëîâî íåäîïðåäñòàâëåíî â êàòåãîðèè, áîëüøå - íàîáîðîò.




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()    Àíàëèç òîíàëüíîñòè òåêñòîâ                              Ìîñêâà, 2013   10 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




   p a r ( mfrow=c ( 1 , 3 ) )
   e p P l o t ( ep ,   eptok ,       ' awesome ' ,     g e n d e r s =' male ' ,     p r o b s= )
                                                                                                T
   e p P l o t ( ep ,   eptok ,       ' awesome ' ,     g e n d e r s =' f e m a l e ' ,   p r o b s= )
                                                                                                     T
   e p P l o t ( ep ,   eptok ,       ' awesome ' ,     g e n d e r s = ' unknown ' ,       p r o b s= )
                                                                                                      T




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ                         Ìîñêâà, 2013   11 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   12 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




   p a r ( mfrow=c ( 2 , 3 ) )
   for     ( i   in   1:5)      {   e p P l o t ( ep ,   eptok ,     ' awesome ' ,     a g e s=i ,   p r o b s= )
                                                                                                               T    }




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ                            Ìîñêâà, 2013   13 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   14 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




   Ïîñìîòðèì èçìåíåíèå ïàðàìåòðà äëÿ êàæäîé êàòåãîðèè îòäåëüíî:

   e p C a t e g o r y B y F a c t o r P l o t ( ep ,   eptok ,    ' awesome ' ,       ' Gender ' ,   p r o b s=T ,   t y p e = 'b
          ')




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ                            Ìîñêâà, 2013    15 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




             Óïðàæíåíèå: ïîäáåðèòå ñëîâà äëÿ ëþáîãî ïàðàìåòðà (Age, Group, Gender), êîòîðûå
                                               ïîä÷åðêèâàþò âàæíîñòü ïàðàìåòðà.

Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   16 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




   Ðàññêàçû ñî ñëîâîì drunk ñèëüíî çàâèñÿò îò âîçðàñòà àâòîðà

   e p C a t e g o r y B y F a c t o r P l o t ( ep ,   eptok ,    ' drunk ' ,         ' Age ' ,   p r o b s=T ,   t y p e ='b ' )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ                                  Ìîñêâà, 2013   17 / 41
×àñòü 1. Ó÷åò êîíòåêñòà




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   18 / 41
×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè




   Ïîïðîáóåì ïîñòðîèòü ýòó çàâèñèìîñòü ïðè ïîìîùè ëîãèñòè÷åñêîé ðåãðåññèè:

   d r u n k = e p F u l l F r a m e ( ep ,       ' drunk ' ,   a g e=c ( 1 , 2 , 3 , 4 , 5 ) )
   drunk$Age = as . numeric ( drunk$Age )
   f i t . g l m = g l m ( c b i n d ( C o u n t , T o t a l −C o u n t )   ~   Category      −   1 + Age ,   d a t a=d r u n k ,
            f a m i l y =b i n o m i a l )
   summary ( f i t . g l m )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()    Àíàëèç òîíàëüíîñòè òåêñòîâ                           Ìîñêâà, 2013    19 / 41
×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè




   Íàïèøåì ôóíêöèþ, ïðîãíîçèðóþùóþ çíà÷åíèÿ â çàâèñèìîñòè îò êàòåãîðèè è âîçðàñòà

   FittedGlmFunc =                function ( fit ,              category ,       age )       {
   coefs      =    fit$coef
   cat . coef       =   c o e f s [ [ paste ( ' Category ' , category ,                      sep = ' ') ] ]
   prediction           =   p l o g i s ( cat . coef        +    c o e f s [ [ ' Age ' ] ]   ∗   age )
   return ( prediction )
   }


   Âûçîâ ôóíêöèè:

   F i t t e d G l m F u n c ( f i t . glm ,    ' wow ' ,       1)




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()       Àíàëèç òîíàëüíîñòè òåêñòîâ                        Ìîñêâà, 2013   20 / 41
×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè




   Âèçóàëèçèðóåì ïîëó÷åííûå çíà÷åíèÿ è ñðàâíèì èõ ñ èñòèííûìè:

   p a r ( mfrow=c ( 2 , 3 ) )
   cats      =      l e v e l s ( ep$Category )
   for ( i       in     1:5)     {
       e p P l o t ( ep ,    eptok ,       ' drunk ' ,        a g e= i )
       for    ( j     in    1:5)     {
           val      = F i t t e d G l m F u n c ( f i t . glm ,      cats [ j ] ,     i )
           points ( j ,        val ,     c o l =' r e d ' ,    p c h =19)
       }
   }




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()           Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   21 / 41
×àñòü 1. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   22 / 41
×àñòü 2




   Àíàëèç ñëîâîñî÷åòàíèé "íàðå÷èå-ïðèëàãàòåëüíîå" íà ïðèìåðå ðåéòèíãîâûõ äàííûõ.




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   23 / 41
×àñòü 2. Äàííûå



   Èñïîëüçóåì äàííûå èç ðàçëè÷íûõ ðåéòèíãîâûõ ñèñòåì(Amazon.com, OpenTable.com,
   Goodreads.com, IMDB.com). Çàãðóçèì èõ:

   d = read . csv ( ' ratings              −a d v a d j . c s v   ')
   head ( d )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()      Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   24 / 41
×àñòü 2. Èçâëå÷åíèå ïîäâûáîðîê


   horrid       =   ratingFullFrame (d ,                ' horrid ' ,      t y p e s=NULL ,       m o d i f i e r s =NULL ,
          m o d i f i e r . t y p e s=NULL ,      r a t i n g m a x =0)
   nrow ( h o r r i d )
   head ( h o r r i d )




   Ñ óêàçàíèåì ìîäèôèêàòîðà:

   horrid       =   ratingFullFrame (d ,                ' horrid ' ,      m o d i f i e r s =' a b s o l u t e l y ' )
   nrow ( h o r r i d )
   head ( h o r r i d )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ                                     Ìîñêâà, 2013   25 / 41
×àñòü 2. Îöåíêà òîíàëüíîñòè îòäåëüíûõ ïðèëàãàòåëüíûõ


   horrid       =   ratingCollapsedFrame (d ,                  ' horrid ' ,          f r e q s =TRUE ,   p r o b s=TRUE )
   horrid




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ                                Ìîñêâà, 2013   26 / 41
×àñòü 2. Ãðàôèê òîíàëüíîñòè

   p a r ( mfrow=c ( 1 , 2 ) )
   ratingPlot (d ,            ' horrid ' ,       p r o b s=FALSE )
   ratingPlot (d ,            ' horrid ' ,       p r o b s=TRUE)




              Âîïðîñ: ïðåäëîæèòå ïðèëàãàòåëüíûå, êîòîðûå ìàêñèìèçèðóþò ñåðåäèíó ãðàôèêà.

Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   27 / 41
×àñòü 2. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ




   Ïîïðîáóåì ñïðîãíîçèðîâàòü êàòåãîðèþ, èñõîäÿ èç ïðèëàãàòåëüíîãî. Äëÿ ýòîãî ïîñ÷èòàåì ìàò.
   îæèäàíèå.

   sum ( h o r r i d $ C a t e g o r y   ∗   horrid$Pr )


   Òî æå ñàìîå äåëàåì ôóíêöèÿ ExpectedCategory:

   ExpectedCategory ( h o r r i d )


   Äîáàâëåíèå îæèäàåìîãî çíà÷åíèÿ íà ãðàôèê:

   ratingPlot (d ,            ' horrid ' ,       p r o b s=TRUE ,    e c=TRUE)




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()    Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   28 / 41
×àñòü 2. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   29 / 41
×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü




   Ïîïðîáóåì ïîñòðîèòü ìîäåëü äëÿ îöåíêè âåðîÿòíîñòè íàõîæäåíèÿ ñëîâà â êàòåãîðèè.

   f i t . horrid       = glm ( c b i n d ( h o r r i d $ C o u n t ,      horrid$Total          −h o r r i d $ C o u n t )   ~
         Category ,          f a m i l y =q u a s i b i n o m i a l ,   d a t a= h o r r i d )
   f i t . horrid




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()       Àíàëèç òîíàëüíîñòè òåêñòîâ                                  Ìîñêâà, 2013   30 / 41
×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   31 / 41
×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü

   Óëó÷øèì ìîäåëü, ñäåëàâ åå áîëåå êâàäðàòè÷íîé (äîáàâèì êâàäðàò êàòåãîðèè).

   GlmWordQuadratic
   function ( pf )          {
       pf$Category2 =                p f $ C a t e g o r y ^2
       fit    = g l m ( c b i n d ( C o u n t , T o t a l −C o u n t )    ~   Category + Category2 ,                       f a m i l y=
             quasibinomial ,               d a t a=p f )
       return ( f i t )
   }
   p a r ( mfrow=c ( 2 , 2 ) )
   ratingPlot (d ,              ' good ' ,    p r o b s=TRUE ,       m o d e l s=c ( G l m W o r d Q u a d r a t i c ) ,     ratingmax
         =5 ,    y l i m=c ( 0 ,     0.5) )
   ratingPlot (d ,              ' good ' ,    p r o b s=TRUE ,       m o d e l s=c ( G l m W o r d Q u a d r a t i c ) ,     ratingmax
         =10 ,     y l i m=c ( 0 ,     0.3) )
   ratingPlot (d ,              ' disappointing ' ,              p r o b s=TRUE ,        m o d e l s=c ( G l m W o r d Q u a d r a t i c ) ,
          r a t i n g m a x =5 ,    y l i m=c ( 0 ,     0.5) )
   ratingPlot (d ,              ' disappointing ' ,              p r o b s=TRUE ,        m o d e l s=c ( G l m W o r d Q u a d r a t i c ) ,
          r a t i n g m a x =10 ,    y l i m=c ( 0 ,     0.3) )
Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()       Àíàëèç òîíàëüíîñòè òåêñòîâ                                     Ìîñêâà, 2013        32 / 41
×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   33 / 41
×àñòü 3




   Âåêòîðíûå ìîäåëè.




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   34 / 41
×àñòü 3. Èñõîäíûå äàííûå



   Èñïîëüçóåòñÿ áàçà IMDB.
   Íà÷àëüíûå äàííûå - ìàòðèöà òåðì x òåðì, ãäå ýëåìåíò ìàòðèöû - ýòî ÷àñòîòà ñîâñòðå÷àåìîñòè
   äâóõ òåðìîâ â îäíîì êîíòåêñòå (äîêóìåíòå, ïðåäëîæåíèè è ò.ä.)

   s o u r c e ( ' vsm . R ' )
   i m d b = C s v 2 M a t r i x ( ' imdb −w o r d w o r d . c s v ' )
   imdb [ 1 0 0 : 1 1 0 ,     100:110]




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()   Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   35 / 41
×àñòü 3. Áàçîâûå ïîíÿòèÿ î âåêòîðàõ


   Åâêëèäîâî ðàññòîÿíèå:
                                                                                   n


                                               EuclideanDist (x , y ) =                        (x − y )2
                                                                                                i           i

                                                                               i   =1

   Äëèíà âåêòîðà:
                                                                                       n


                                                    VectorLength(x ) =                         (x )2i

                                                                                   i   =1

   Íîðìàëèçàöèÿ âåêòîðà - äåëåíèå êàæäîé êîìïîíåíòû íà äëèíó.
   Êîñèíóñ óãëà ìåæäó âåêòîðàìè:


                                                                          (x ) ∗    =1 (y )
                                                                          n                             n


                                CosineDist (x , y ) = 1 −                 i   =1           i            i       i


                                                              VectorLength(x ) ∗ VectorLength(y )


Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()    Àíàëèç òîíàëüíîñòè òåêñòîâ                                 Ìîñêâà, 2013   36 / 41
×àñòü 3. Ñåìàíòè÷åñêè áëèçêèå ñëîâà




   d f = N e i g h b o r s ( imdb ,         ' happy ' )
   head ( d f )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()     Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   37 / 41
×àñòü 3. Ñåìàíòè÷åñêè áëèçêèå ñëîâà



   Ïðîáëåìà:

   a = c (1000 ,          2000 ,      3000)
   b = c (1 ,       2,    3)
   a / sum ( a )
  >    [1]     0.1666667          0.3333333             0.5000000
   b / sum ( b )
  >    [1]     0.1666667          0.3333333             0.5000000
   LengthNorm ( a )
  >    [1]     0.2672612          0.5345225             0.8017837
   LengthNorm ( b )
  >    [1]     0.2672612          0.5345225             0.801783




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()      Àíàëèç òîíàëüíîñòè òåêñòîâ   Ìîñêâà, 2013   38 / 41
×àñòü 3. PMI - Pointwise mutual information




   Êàê ýòîãî èçáåæàòü? - PMI!
                                                                                p (x , y )
                                                        PMI (x , y ) = log
                                                                             p (x ) ∗ p (y )
   Íîðìèðîâêà PMI:


                                                              p (i , j )   min (
                                                                                           m
                                                                                                p (k , j ),        n
                                                                                                                        p (k , j ))
                    NewPMI (i , j ) = pmi (i , j ) ∗                     ∗                 k   =1                  k   =1
                                                           p (i , j ) + 1 min (       m

                                                                                      k   =1 p (k , j ),
                                                                                                              n

                                                                                                              k   =1 p (k , j )) + 1

   Çäåñü p(i,j)=M/sum(M), M - ìàòðèöà òåðìîâ




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()        Àíàëèç òîíàëüíîñòè òåêñòîâ                                         Ìîñêâà, 2013   39 / 41
×àñòü 3. PMI - Pointwise mutual information




   i m d b . p p c d = PMI ( imdb ,           p o s i t i v e =TRUE ,     d i s c o u n t i n g=TRUE)
   d f = N e i g h b o r s ( imdb . ppcd ,              ' happy ' ,   b y r o w=TRUE ,    d i s t f u n c =C o s i n e D i s t a n c e )
   head ( d f )




Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()       Àíàëèç òîíàëüíîñòè òåêñòîâ                               Ìîñêâà, 2013     40 / 41
×àñòü 3. Semantic orientation method

         Îïèøåì äâà ìíîæåñòâà ñëîâ S1 è S2

         Âûáåðåì ìåðó áëèçîñòè

         Äëÿ íåêîòîðîãî ñëîâà w, ïîñ÷èòàåì ñóììó ðàññòîÿíèé äî âåêòîðîâ ìíîæåñòâ S1 è S2

         Îöåíêà òîíàëüíîñòè - ðàçíèöà ìåæäó ñóììàìè ðàññòîÿíèé


   n e g = c ( ' bad ' ,        ' nasty ' ,       ' poor ' ,     ' negative ' ,         ' unfortunate ' ,          ' wrong ' ,        '
          inferior ')
   p o s = c ( ' go o d ' ,       ' nice ' ,      ' excellent ' ,         ' positive ' ,        ' fortunate ' ,          ' correct ' ,
            ' superior ')
   S e m a n t i c O r i e n t a t i o n ( imdb . ppcd ,       word =' g r e a t ' ,     s e e d s 1=n e g ,    s e e d s 2=p o s ,
          d i s t f u n c =C o s i n e D i s t a n c e )
  >    0.8923544
   S e m a n t i c O r i e n t a t i o n ( imdb . p p c i ,    word =' h o r r i d ' ,    s e e d s 1=n e g ,    s e e d s 2=p o s ,
          d i s t f u n c =C o s i n e D i s t a n c e )
  >    −0.04741898

Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com ()      Àíàëèç òîíàëüíîñòè òåêñòîâ                                  Ìîñêâà, 2013       41 / 41

Weitere ähnliche Inhalte

Kürzlich hochgeladen

ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning ProjectNuckles
 
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdf
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdfمحاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdf
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdfKhaled Elbattawy
 
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...Eesti Loodusturism
 
Català Individual 3r - Víctor.pdf JOCS FLORALS
Català Individual 3r - Víctor.pdf JOCS FLORALSCatalà Individual 3r - Víctor.pdf JOCS FLORALS
Català Individual 3r - Víctor.pdf JOCS FLORALSErnest Lluch
 
Català parelles 3r - Emma i Ariadna (1).pdf
Català parelles 3r - Emma i Ariadna (1).pdfCatalà parelles 3r - Emma i Ariadna (1).pdf
Català parelles 3r - Emma i Ariadna (1).pdfErnest Lluch
 
Castellà parelles 2n - Abril i Irina.pdf
Castellà parelles 2n - Abril i Irina.pdfCastellà parelles 2n - Abril i Irina.pdf
Castellà parelles 2n - Abril i Irina.pdfErnest Lluch
 
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptx
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptxRESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptx
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptxscbastidasv
 

Kürzlich hochgeladen (8)

ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdf
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdfمحاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdf
محاضرات الاحصاء التطبيقي لطلاب علوم الرياضة.pdf
 
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...
Saunanaine_Helen Moppel_JUHENDATUD SAUNATEENUSE JA LOODUSMATKA SÜNERGIA_strat...
 
Català Individual 3r - Víctor.pdf JOCS FLORALS
Català Individual 3r - Víctor.pdf JOCS FLORALSCatalà Individual 3r - Víctor.pdf JOCS FLORALS
Català Individual 3r - Víctor.pdf JOCS FLORALS
 
Català parelles 3r - Emma i Ariadna (1).pdf
Català parelles 3r - Emma i Ariadna (1).pdfCatalà parelles 3r - Emma i Ariadna (1).pdf
Català parelles 3r - Emma i Ariadna (1).pdf
 
Castellà parelles 2n - Abril i Irina.pdf
Castellà parelles 2n - Abril i Irina.pdfCastellà parelles 2n - Abril i Irina.pdf
Castellà parelles 2n - Abril i Irina.pdf
 
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptx
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptxRESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptx
RESOLUCION DEL SIMULACRO UNMSM 2023 ii 2.pptx
 
Díptic IFE (2) ifeifeifeife ife ife.pdf
Díptic IFE (2)  ifeifeifeife ife ife.pdfDíptic IFE (2)  ifeifeifeife ife ife.pdf
Díptic IFE (2) ifeifeifeife ife ife.pdf
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Empfohlen (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Sentiment analysis

  • 1. Àíàëèç òîíàëüíîñòè òåêñòîâ Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com Ìîñêâà, 2013 Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 1 / 41
  • 2. ×àñòü 1. Experience project Îïèñàíèå Experience project: ïðîåêò, â ðàìêàõ êîòîðîãî ïîëüçîâàòåëè äåëÿòñÿ ñâîèìè èñòîðèÿìè. Êàæäîé èñòîðèè ÷èòàòåëè ìîãóò âûñòàâèòü îäíó èç ïÿòè êàòåãîðèé è íàïèñàòü êîììåíòàðèé. Ïðèìåð: I really hate being shy ... I just want to be able to talk to someone about anything and everything and be myself ... That's all I've ever wanted. Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0; Author age: 21 Author gender:female Text group: friends Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 2 / 41
  • 3. ×àñòü 1. Äàííûå Çàãðóçèì äàííûå: e p = r e a d . c s v ( ' e p 3−c o n t e x t . c s v ' ) Çäåñü: Count - âñòðå÷àåìîñòü äàííîãî ñëîâà â ñîîòâåòñòâóþùåé Category, Group è ïðè óêàçàííûõ Genger, Age Total - îáùåå êîëè÷åñòâî ñëîâ ñîîòâåòñòâóþùåé Category, Group è ïðè óêàçàííûõ Genger, Age Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 3 / 41
  • 4. ×àñòü 1. Ðàáîòà ñ äàííûìè Ìîæíî ïîëó÷èòü äàííûå ïî ëþáîìó èç ïàðàìåòðîâ: l e v e l s ( ep$Word ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 4 / 41
  • 5. ×àñòü 1. Ñëîâà è êàòåãîðèè Ïîñìîòðèì, êàê ñîîòíîñÿòñÿ ñëîâà ñ êàòåãîðèÿìè: f u n n y = e p C o l l a p s e d F r a m e ( ep , ' funny ' ) p l o t ( funny$Category , funny$Count , x l a b =' C a t e g o r y ' , y l a b =' Count ' , main =' funny ' ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 5 / 41
  • 6. ×àñòü 1. Ñëîâà è êàòåãîðèè Ñòðàííî, ïðàâäà? À ÷òî íå òàê? Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 6 / 41
  • 7. ×àñòü 1. Ñëîâà è êàòåãîðèè Íóæíà íîðìàëèçàöèÿ íà ðàçìåð êàòåãîðèè! funny$Count / funny$Total f u n n y = e p C o l l a p s e d F r a m e ( ep , ' funny ' , f r e q s=TRUE) p l o t ( funny$Category , funny$Freq , x l a b =' C a t e g o r y ' , y l a b =' Count / T o t a l ' , main =' f u n n y ' ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 7 / 41
  • 8. ×àñòü 1. Ñëîâà è êàòåãîðèè Ãîðàçäî ëó÷øå! Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 8 / 41
  • 9. ×àñòü 1. Ê òåîðèè âåðîÿòíîñòåé Freq ñîîòâåòñòâóåò óñëîâíîé âåðîÿòíîñòè P(word|category). Ýòî çíà÷åíèå âåñüìà ìàëî, ïîýòîìó ïîñ÷èòàåì âåðîÿòíîñòü P(category|word) (âñå ïîìíÿò ôîðìóëó Áàéåñà?:-) ) . funny$Freq / sum ( f u n n y $ F r e q ) f u n n y = e p C o l l a p s e d F r a m e ( ep , ' funny ' , f r e q s =TRUE , p r o b s=TRUE) p l o t ( funny$Category , funny$Pr , x l a b =' C a t e g o r y ' , y l a b = ' ( C o u n t / T o t a l ) / sum ( Count / T o t a l ) ' , main =' f u n n y ' ) Âîïðîñ: Êàêèå åùå ñëîâà ìîãóò õîðîøî õàðàêòåðèçîâàòü êàêóþ-ëèáî èç êàòåãîðèé? Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 9 / 41
  • 10. ×àñòü 1. Ñðàâíèì çíà÷åíèÿ ñ îæèäàåìîé âåðîÿòíîñòüþ f u n n y = e p C o l l a p s e d F r a m e ( ep , ' funny ' , f r e q s =TRUE , p r o b s=TRUE , o e=TRUE ) Êàê ïîñ÷èòàòü îæèäàåìîå çíà÷åíèå? c a t e g o r y . p r o b s = ( f u n n y $ T o t a l / sum ( f u n n y $ T o t a l ) ) f u n n y . c o u n t = sum ( f u n n y $ C o u n t ) funny . expected = funny . count ∗ category . probs funny . expected Ïîñìîòðèì íà îòíîøåíèå: ( funny$observed / funny . expected ) − 1 Åñëè îíî ìåíüøå 0 - ñëîâî íåäîïðåäñòàâëåíî â êàòåãîðèè, áîëüøå - íàîáîðîò. Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 10 / 41
  • 11. ×àñòü 1. Ó÷åò êîíòåêñòà p a r ( mfrow=c ( 1 , 3 ) ) e p P l o t ( ep , eptok , ' awesome ' , g e n d e r s =' male ' , p r o b s= ) T e p P l o t ( ep , eptok , ' awesome ' , g e n d e r s =' f e m a l e ' , p r o b s= ) T e p P l o t ( ep , eptok , ' awesome ' , g e n d e r s = ' unknown ' , p r o b s= ) T Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 11 / 41
  • 12. ×àñòü 1. Ó÷åò êîíòåêñòà Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 12 / 41
  • 13. ×àñòü 1. Ó÷åò êîíòåêñòà p a r ( mfrow=c ( 2 , 3 ) ) for ( i in 1:5) { e p P l o t ( ep , eptok , ' awesome ' , a g e s=i , p r o b s= ) T } Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 13 / 41
  • 14. ×àñòü 1. Ó÷åò êîíòåêñòà Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 14 / 41
  • 15. ×àñòü 1. Ó÷åò êîíòåêñòà Ïîñìîòðèì èçìåíåíèå ïàðàìåòðà äëÿ êàæäîé êàòåãîðèè îòäåëüíî: e p C a t e g o r y B y F a c t o r P l o t ( ep , eptok , ' awesome ' , ' Gender ' , p r o b s=T , t y p e = 'b ') Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 15 / 41
  • 16. ×àñòü 1. Ó÷åò êîíòåêñòà Óïðàæíåíèå: ïîäáåðèòå ñëîâà äëÿ ëþáîãî ïàðàìåòðà (Age, Group, Gender), êîòîðûå ïîä÷åðêèâàþò âàæíîñòü ïàðàìåòðà. Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 16 / 41
  • 17. ×àñòü 1. Ó÷åò êîíòåêñòà Ðàññêàçû ñî ñëîâîì drunk ñèëüíî çàâèñÿò îò âîçðàñòà àâòîðà e p C a t e g o r y B y F a c t o r P l o t ( ep , eptok , ' drunk ' , ' Age ' , p r o b s=T , t y p e ='b ' ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 17 / 41
  • 18. ×àñòü 1. Ó÷åò êîíòåêñòà Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 18 / 41
  • 19. ×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè Ïîïðîáóåì ïîñòðîèòü ýòó çàâèñèìîñòü ïðè ïîìîùè ëîãèñòè÷åñêîé ðåãðåññèè: d r u n k = e p F u l l F r a m e ( ep , ' drunk ' , a g e=c ( 1 , 2 , 3 , 4 , 5 ) ) drunk$Age = as . numeric ( drunk$Age ) f i t . g l m = g l m ( c b i n d ( C o u n t , T o t a l −C o u n t ) ~ Category − 1 + Age , d a t a=d r u n k , f a m i l y =b i n o m i a l ) summary ( f i t . g l m ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 19 / 41
  • 20. ×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè Íàïèøåì ôóíêöèþ, ïðîãíîçèðóþùóþ çíà÷åíèÿ â çàâèñèìîñòè îò êàòåãîðèè è âîçðàñòà FittedGlmFunc = function ( fit , category , age ) { coefs = fit$coef cat . coef = c o e f s [ [ paste ( ' Category ' , category , sep = ' ') ] ] prediction = p l o g i s ( cat . coef + c o e f s [ [ ' Age ' ] ] ∗ age ) return ( prediction ) } Âûçîâ ôóíêöèè: F i t t e d G l m F u n c ( f i t . glm , ' wow ' , 1) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 20 / 41
  • 21. ×àñòü 1. Ïîñòðîåíèå ðåãðåññèîííîé ìîäåëè Âèçóàëèçèðóåì ïîëó÷åííûå çíà÷åíèÿ è ñðàâíèì èõ ñ èñòèííûìè: p a r ( mfrow=c ( 2 , 3 ) ) cats = l e v e l s ( ep$Category ) for ( i in 1:5) { e p P l o t ( ep , eptok , ' drunk ' , a g e= i ) for ( j in 1:5) { val = F i t t e d G l m F u n c ( f i t . glm , cats [ j ] , i ) points ( j , val , c o l =' r e d ' , p c h =19) } } Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 21 / 41
  • 22. ×àñòü 1. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 22 / 41
  • 23. ×àñòü 2 Àíàëèç ñëîâîñî÷åòàíèé "íàðå÷èå-ïðèëàãàòåëüíîå" íà ïðèìåðå ðåéòèíãîâûõ äàííûõ. Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 23 / 41
  • 24. ×àñòü 2. Äàííûå Èñïîëüçóåì äàííûå èç ðàçëè÷íûõ ðåéòèíãîâûõ ñèñòåì(Amazon.com, OpenTable.com, Goodreads.com, IMDB.com). Çàãðóçèì èõ: d = read . csv ( ' ratings −a d v a d j . c s v ') head ( d ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 24 / 41
  • 25. ×àñòü 2. Èçâëå÷åíèå ïîäâûáîðîê horrid = ratingFullFrame (d , ' horrid ' , t y p e s=NULL , m o d i f i e r s =NULL , m o d i f i e r . t y p e s=NULL , r a t i n g m a x =0) nrow ( h o r r i d ) head ( h o r r i d ) Ñ óêàçàíèåì ìîäèôèêàòîðà: horrid = ratingFullFrame (d , ' horrid ' , m o d i f i e r s =' a b s o l u t e l y ' ) nrow ( h o r r i d ) head ( h o r r i d ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 25 / 41
  • 26. ×àñòü 2. Îöåíêà òîíàëüíîñòè îòäåëüíûõ ïðèëàãàòåëüíûõ horrid = ratingCollapsedFrame (d , ' horrid ' , f r e q s =TRUE , p r o b s=TRUE ) horrid Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 26 / 41
  • 27. ×àñòü 2. Ãðàôèê òîíàëüíîñòè p a r ( mfrow=c ( 1 , 2 ) ) ratingPlot (d , ' horrid ' , p r o b s=FALSE ) ratingPlot (d , ' horrid ' , p r o b s=TRUE) Âîïðîñ: ïðåäëîæèòå ïðèëàãàòåëüíûå, êîòîðûå ìàêñèìèçèðóþò ñåðåäèíó ãðàôèêà. Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 27 / 41
  • 28. ×àñòü 2. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ Ïîïðîáóåì ñïðîãíîçèðîâàòü êàòåãîðèþ, èñõîäÿ èç ïðèëàãàòåëüíîãî. Äëÿ ýòîãî ïîñ÷èòàåì ìàò. îæèäàíèå. sum ( h o r r i d $ C a t e g o r y ∗ horrid$Pr ) Òî æå ñàìîå äåëàåì ôóíêöèÿ ExpectedCategory: ExpectedCategory ( h o r r i d ) Äîáàâëåíèå îæèäàåìîãî çíà÷åíèÿ íà ãðàôèê: ratingPlot (d , ' horrid ' , p r o b s=TRUE , e c=TRUE) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 28 / 41
  • 29. ×àñòü 2. Ïîäñ÷åò îæèäàåìîãî çíà÷åíèÿ Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 29 / 41
  • 30. ×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü Ïîïðîáóåì ïîñòðîèòü ìîäåëü äëÿ îöåíêè âåðîÿòíîñòè íàõîæäåíèÿ ñëîâà â êàòåãîðèè. f i t . horrid = glm ( c b i n d ( h o r r i d $ C o u n t , horrid$Total −h o r r i d $ C o u n t ) ~ Category , f a m i l y =q u a s i b i n o m i a l , d a t a= h o r r i d ) f i t . horrid Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 30 / 41
  • 31. ×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 31 / 41
  • 32. ×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü Óëó÷øèì ìîäåëü, ñäåëàâ åå áîëåå êâàäðàòè÷íîé (äîáàâèì êâàäðàò êàòåãîðèè). GlmWordQuadratic function ( pf ) { pf$Category2 = p f $ C a t e g o r y ^2 fit = g l m ( c b i n d ( C o u n t , T o t a l −C o u n t ) ~ Category + Category2 , f a m i l y= quasibinomial , d a t a=p f ) return ( f i t ) } p a r ( mfrow=c ( 2 , 2 ) ) ratingPlot (d , ' good ' , p r o b s=TRUE , m o d e l s=c ( G l m W o r d Q u a d r a t i c ) , ratingmax =5 , y l i m=c ( 0 , 0.5) ) ratingPlot (d , ' good ' , p r o b s=TRUE , m o d e l s=c ( G l m W o r d Q u a d r a t i c ) , ratingmax =10 , y l i m=c ( 0 , 0.3) ) ratingPlot (d , ' disappointing ' , p r o b s=TRUE , m o d e l s=c ( G l m W o r d Q u a d r a t i c ) , r a t i n g m a x =5 , y l i m=c ( 0 , 0.5) ) ratingPlot (d , ' disappointing ' , p r o b s=TRUE , m o d e l s=c ( G l m W o r d Q u a d r a t i c ) , r a t i n g m a x =10 , y l i m=c ( 0 , 0.3) ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 32 / 41
  • 33. ×àñòü 2. Ðåãðåññèîííàÿ ìîäåëü Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 33 / 41
  • 34. ×àñòü 3 Âåêòîðíûå ìîäåëè. Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 34 / 41
  • 35. ×àñòü 3. Èñõîäíûå äàííûå Èñïîëüçóåòñÿ áàçà IMDB. Íà÷àëüíûå äàííûå - ìàòðèöà òåðì x òåðì, ãäå ýëåìåíò ìàòðèöû - ýòî ÷àñòîòà ñîâñòðå÷àåìîñòè äâóõ òåðìîâ â îäíîì êîíòåêñòå (äîêóìåíòå, ïðåäëîæåíèè è ò.ä.) s o u r c e ( ' vsm . R ' ) i m d b = C s v 2 M a t r i x ( ' imdb −w o r d w o r d . c s v ' ) imdb [ 1 0 0 : 1 1 0 , 100:110] Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 35 / 41
  • 36. ×àñòü 3. Áàçîâûå ïîíÿòèÿ î âåêòîðàõ Åâêëèäîâî ðàññòîÿíèå: n EuclideanDist (x , y ) = (x − y )2 i i i =1 Äëèíà âåêòîðà: n VectorLength(x ) = (x )2i i =1 Íîðìàëèçàöèÿ âåêòîðà - äåëåíèå êàæäîé êîìïîíåíòû íà äëèíó. Êîñèíóñ óãëà ìåæäó âåêòîðàìè: (x ) ∗ =1 (y ) n n CosineDist (x , y ) = 1 − i =1 i i i VectorLength(x ) ∗ VectorLength(y ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 36 / 41
  • 37. ×àñòü 3. Ñåìàíòè÷åñêè áëèçêèå ñëîâà d f = N e i g h b o r s ( imdb , ' happy ' ) head ( d f ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 37 / 41
  • 38. ×àñòü 3. Ñåìàíòè÷åñêè áëèçêèå ñëîâà Ïðîáëåìà: a = c (1000 , 2000 , 3000) b = c (1 , 2, 3) a / sum ( a ) > [1] 0.1666667 0.3333333 0.5000000 b / sum ( b ) > [1] 0.1666667 0.3333333 0.5000000 LengthNorm ( a ) > [1] 0.2672612 0.5345225 0.8017837 LengthNorm ( b ) > [1] 0.2672612 0.5345225 0.801783 Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 38 / 41
  • 39. ×àñòü 3. PMI - Pointwise mutual information Êàê ýòîãî èçáåæàòü? - PMI! p (x , y ) PMI (x , y ) = log p (x ) ∗ p (y ) Íîðìèðîâêà PMI: p (i , j ) min ( m p (k , j ), n p (k , j )) NewPMI (i , j ) = pmi (i , j ) ∗ ∗ k =1 k =1 p (i , j ) + 1 min ( m k =1 p (k , j ), n k =1 p (k , j )) + 1 Çäåñü p(i,j)=M/sum(M), M - ìàòðèöà òåðìîâ Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 39 / 41
  • 40. ×àñòü 3. PMI - Pointwise mutual information i m d b . p p c d = PMI ( imdb , p o s i t i v e =TRUE , d i s c o u n t i n g=TRUE) d f = N e i g h b o r s ( imdb . ppcd , ' happy ' , b y r o w=TRUE , d i s t f u n c =C o s i n e D i s t a n c e ) head ( d f ) Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 40 / 41
  • 41. ×àñòü 3. Semantic orientation method Îïèøåì äâà ìíîæåñòâà ñëîâ S1 è S2 Âûáåðåì ìåðó áëèçîñòè Äëÿ íåêîòîðîãî ñëîâà w, ïîñ÷èòàåì ñóììó ðàññòîÿíèé äî âåêòîðîâ ìíîæåñòâ S1 è S2 Îöåíêà òîíàëüíîñòè - ðàçíèöà ìåæäó ñóììàìè ðàññòîÿíèé n e g = c ( ' bad ' , ' nasty ' , ' poor ' , ' negative ' , ' unfortunate ' , ' wrong ' , ' inferior ') p o s = c ( ' go o d ' , ' nice ' , ' excellent ' , ' positive ' , ' fortunate ' , ' correct ' , ' superior ') S e m a n t i c O r i e n t a t i o n ( imdb . ppcd , word =' g r e a t ' , s e e d s 1=n e g , s e e d s 2=p o s , d i s t f u n c =C o s i n e D i s t a n c e ) > 0.8923544 S e m a n t i c O r i e n t a t i o n ( imdb . p p c i , word =' h o r r i d ' , s e e d s 1=n e g , s e e d s 2=p o s , d i s t f u n c =C o s i n e D i s t a n c e ) > −0.04741898 Âûëîìîâà Åêàòåðèíà Àëåêñååâíà: evylomova@gmail.com () Àíàëèç òîíàëüíîñòè òåêñòîâ Ìîñêâà, 2013 41 / 41