SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions




                        Does size matter? When small is better.

                     A.L. Gentile               A.E. Cano A.-S. Dadzie                V. Lanfranchi
                                                        N. Ireson

                                                  Department of Computer Science,
                                                       University of Sheffield


                                                         2011, 23rd March




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson               Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Outline


        1     Introduction

        2     Email Corpus

        3     Dynamic Topic Classification Of Short Texts

        4     Experiments

        5     Conclusions




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Settings


                 Main Goal: observation of the influence of the size of documents
                 on the accuracy of a defined text processing task
                 Hypothesis: results obtained using longer texts may be
                 approximated by short texts, of micropost size, i.e., maximum
                 length 140 characters
                 Dataset: artificially generated comparable corpora, consisting of
                 truncated emails, from micropost size (140 characters), and
                 successive multiples thereof
                 Methodology:
                          corpus-driven topic extraction
                          document topic classification


A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Research Questions


                 statistical analysis of emails exchanged via oak mailing list
                 (over a period of six months) to determine if email is indeed used
                 as a short messaging service
                 content analysis of emails as microposts, to evaluate to what
                 degree the knowledge content of truncated or abbreviated
                 messages can be compared to the complete message
                 determine if the knowledge content of short emails may be used
                 to obtain useful information about e.g., topics of interest or
                 expertise within an organisation, as a basis for carrying out tasks
                 such as expert finding or content-based social network analysis
                 (SNA)


A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Micropost Services


                 mainly used for social information exchange, but also used in
                 more formal (working) environments
                 [Herbsleb et al., 2002, Isaacs et al., 2002]
                 sometimes perceived negatively in the workplace, as they may be
                 seen to reduce productivity [TNS US Group, 2009], and/or pose
                 threats to security and privacy
                 where restrictions to use are in place, alternatives are sought that
                 obtain the same benefits
                          same communication patterns in alternative media: email as a
                          short message service for communication via, e.g., mailing lists




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Oak mailing list based corpus
        six month period (July - January 2011), totalling 659 emails




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Dynamic Topic Classification Of Short Texts

                 Goal: evaluate to what degree the knowledge content of a shorter
                 message can be compared to that of a full message.
                 Chosen task: text classification on non-predefined topics.
                 Test bed: generated by preprocessing the oak email corpus to
                 obtain several fixed-size corpora.
                 Method:
                          Corpus-driven topic extraction: a number of topics are
                          automatically extracted from a document collection; each topic is
                          represented as a weighted vector of terms;
                          Document topic classification: each document is labelled with the
                          topic it is most similar to, and classified into the corresponding
                          cluster.


A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Topic Extraction: Proximity-based Clustering



                 Document corpus D = {d1 , ..., dk }
                     each document di = {t1 , ..., tv } is a vector of weighted terms
                 Term clusters C = {c1 , ..., ck }
                 (clustering performed by using as feature space the inverted
                 index of D)
                      each cluster ck = {t1 , ..., tn } is a vector of weighted terms
                          each cluster ideally represents a topic in the document collection




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Email Topic Classification




                 sim(d , Ci ): similarity between documents and clusters (by cosine
                 similarity)
                 labelDoc (D ): each document d is mapped to the topic Ci , which
                 maximises the similarity sim(d , Ci )




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Dataset Preparation: Comparable Corpora



           Corpus Name                               Maximum text length of each document
              corpus140                   truncated at length 140 (if longer, full text otherwise)
              corpus280                   truncated at length 280 (if longer, full text otherwise)
              corpus420                   truncated at length 420 (if longer, full text otherwise)
              corpus560                   truncated at length 560 (if longer, full text otherwise)
              corpus700                   truncated at length 700 (if longer, full text otherwise)
              corpus840                   truncated at length 840 (if longer, full text otherwise)
              corpus980                   truncated at length 980 (if longer, full text otherwise)
            mainCorpus                                                full email body



A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson               Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Experimental Approach




                 corpus-driven topic extraction: using mainCorpus
                 email classification: apply the labelDoc procedure to all different
                 comparable corpora, including mainCorpus.




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Corpus Topics




                                         Figura: Visualisation of topic clusters.
A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson          Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Results

                                                              Precision   Recall   F-Measure
                                              corpus140         0.80       0.63       0.69
                                              corpus280         0.90       0.86       0.87
                                              corpus420         0.93       0.92       0.92
                                              corpus560         0.98       0.96       0.97
                                              corpus700         0.95       0.94       0.94
                                              corpus840         0.98       0.98       0.98
                                              corpus980         0.99       0.99       0.99




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson                      Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Conclusions



                 a fair portion of the emails exchanged, for the corpus generated
                 from a mailing list, are very short, with approximately 40% falling
                 within the single micropost size, and 65% up to two microposts;
                 for the text classification task described, the accuracy of
                 classification for micropost size texts is an acceptable
                 approximation of classification performed on longer texts, with a
                 decrease of only ∼ 5% for up to the second micropost block
                 within a long e-mail.




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 Future Work


                 enriching the micro-emails with semantic information (e.g.,
                 concepts extracted from domain and standard ontologies) would
                 improve the results obtained using unannotated text
                 investigate the influence of other similarity measures
                 application to expert finding tasks, exploiting dynamic topic
                 extraction as a means to determine authors’ and recipients’ areas
                 of expertise
                 formal evaluation of topic validity will be required, including the
                 human (expert) annotator in the loop




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson       Does size matter? When small is better.
Introduction
                                                     Email Corpus
                         Dynamic Topic Classification Of Short Texts
                                                      Experiments
                                                      Conclusions


 References



                Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002).
                Introducing instant messaging and chat in the workplace.
                In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages
                171–178.

                Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002).
                The character, functions, and styles of instant messaging in the workplace.
                In Proc., ACM conference on Computer supported cooperative work, pages 11–20.

                TNS US Group (2009).
                Social media exploding: More than 40% use online social networks.
                http://www.tns-us.com/news/social_media_exploding_more_than.php.




A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson                                Does size matter? When small is better.

Weitere ähnliche Inhalte

Andere mochten auch

Relative Trends in Scientific Terms on Twitter
Relative Trends in Scientific Terms on TwitterRelative Trends in Scientific Terms on Twitter
Relative Trends in Scientific Terms on Twitter
aba-sah
 

Andere mochten auch (7)

Relative Trends in Scientific Terms on Twitter
Relative Trends in Scientific Terms on TwitterRelative Trends in Scientific Terms on Twitter
Relative Trends in Scientific Terms on Twitter
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 
Open Source Creativity
Open Source CreativityOpen Source Creativity
Open Source Creativity
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Does Size Matter? When Small is Good Enough

  • 1. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Does size matter? When small is better. A.L. Gentile A.E. Cano A.-S. Dadzie V. Lanfranchi N. Ireson Department of Computer Science, University of Sheffield 2011, 23rd March A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 2. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Outline 1 Introduction 2 Email Corpus 3 Dynamic Topic Classification Of Short Texts 4 Experiments 5 Conclusions A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 3. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Settings Main Goal: observation of the influence of the size of documents on the accuracy of a defined text processing task Hypothesis: results obtained using longer texts may be approximated by short texts, of micropost size, i.e., maximum length 140 characters Dataset: artificially generated comparable corpora, consisting of truncated emails, from micropost size (140 characters), and successive multiples thereof Methodology: corpus-driven topic extraction document topic classification A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 4. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Research Questions statistical analysis of emails exchanged via oak mailing list (over a period of six months) to determine if email is indeed used as a short messaging service content analysis of emails as microposts, to evaluate to what degree the knowledge content of truncated or abbreviated messages can be compared to the complete message determine if the knowledge content of short emails may be used to obtain useful information about e.g., topics of interest or expertise within an organisation, as a basis for carrying out tasks such as expert finding or content-based social network analysis (SNA) A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 5. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Micropost Services mainly used for social information exchange, but also used in more formal (working) environments [Herbsleb et al., 2002, Isaacs et al., 2002] sometimes perceived negatively in the workplace, as they may be seen to reduce productivity [TNS US Group, 2009], and/or pose threats to security and privacy where restrictions to use are in place, alternatives are sought that obtain the same benefits same communication patterns in alternative media: email as a short message service for communication via, e.g., mailing lists A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 6. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Oak mailing list based corpus six month period (July - January 2011), totalling 659 emails A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 7. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Dynamic Topic Classification Of Short Texts Goal: evaluate to what degree the knowledge content of a shorter message can be compared to that of a full message. Chosen task: text classification on non-predefined topics. Test bed: generated by preprocessing the oak email corpus to obtain several fixed-size corpora. Method: Corpus-driven topic extraction: a number of topics are automatically extracted from a document collection; each topic is represented as a weighted vector of terms; Document topic classification: each document is labelled with the topic it is most similar to, and classified into the corresponding cluster. A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 8. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Topic Extraction: Proximity-based Clustering Document corpus D = {d1 , ..., dk } each document di = {t1 , ..., tv } is a vector of weighted terms Term clusters C = {c1 , ..., ck } (clustering performed by using as feature space the inverted index of D) each cluster ck = {t1 , ..., tn } is a vector of weighted terms each cluster ideally represents a topic in the document collection A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 9. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Email Topic Classification sim(d , Ci ): similarity between documents and clusters (by cosine similarity) labelDoc (D ): each document d is mapped to the topic Ci , which maximises the similarity sim(d , Ci ) A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 10. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Dataset Preparation: Comparable Corpora Corpus Name Maximum text length of each document corpus140 truncated at length 140 (if longer, full text otherwise) corpus280 truncated at length 280 (if longer, full text otherwise) corpus420 truncated at length 420 (if longer, full text otherwise) corpus560 truncated at length 560 (if longer, full text otherwise) corpus700 truncated at length 700 (if longer, full text otherwise) corpus840 truncated at length 840 (if longer, full text otherwise) corpus980 truncated at length 980 (if longer, full text otherwise) mainCorpus full email body A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 11. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Experimental Approach corpus-driven topic extraction: using mainCorpus email classification: apply the labelDoc procedure to all different comparable corpora, including mainCorpus. A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 12. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Corpus Topics Figura: Visualisation of topic clusters. A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 13. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Results Precision Recall F-Measure corpus140 0.80 0.63 0.69 corpus280 0.90 0.86 0.87 corpus420 0.93 0.92 0.92 corpus560 0.98 0.96 0.97 corpus700 0.95 0.94 0.94 corpus840 0.98 0.98 0.98 corpus980 0.99 0.99 0.99 A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 14. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Conclusions a fair portion of the emails exchanged, for the corpus generated from a mailing list, are very short, with approximately 40% falling within the single micropost size, and 65% up to two microposts; for the text classification task described, the accuracy of classification for micropost size texts is an acceptable approximation of classification performed on longer texts, with a decrease of only ∼ 5% for up to the second micropost block within a long e-mail. A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 15. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Future Work enriching the micro-emails with semantic information (e.g., concepts extracted from domain and standard ontologies) would improve the results obtained using unannotated text investigate the influence of other similarity measures application to expert finding tasks, exploiting dynamic topic extraction as a means to determine authors’ and recipients’ areas of expertise formal evaluation of topic validity will be required, including the human (expert) annotator in the loop A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.
  • 16. Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions References Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002). Introducing instant messaging and chat in the workplace. In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages 171–178. Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002). The character, functions, and styles of instant messaging in the workplace. In Proc., ACM conference on Computer supported cooperative work, pages 11–20. TNS US Group (2009). Social media exploding: More than 40% use online social networks. http://www.tns-us.com/news/social_media_exploding_more_than.php. A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.