Does Size Matter? When Small is Good Enough

Introduction
Email Corpus
Dynamic Topic Classiﬁcation Of Short Texts
Experiments
Conclusions

Does size matter? When small is better.

A.L. Gentile A.E. Cano A.-S. Dadzie V. Lanfranchi
N. Ireson

Department of Computer Science,
University of Shefﬁeld

2011, 23rd March

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.

Introduction
Email Corpus
Experiments
Conclusions

Outline

1 Introduction

2 Email Corpus

3 Dynamic Topic Classiﬁcation Of Short Texts

4 Experiments

5 Conclusions


Introduction
Email Corpus
Experiments
Conclusions

Settings

Main Goal: observation of the influence of the size of documents
on the accuracy of a defined text processing task
Hypothesis: results obtained using longer texts may be
approximated by short texts, of micropost size, i.e., maximum
length 140 characters
Dataset: artificially generated comparable corpora, consisting of
truncated emails, from micropost size (140 characters), and
successive multiples thereof
Methodology:
corpus-driven topic extraction
document topic classification


Introduction
Email Corpus
Experiments
Conclusions

Research Questions

statistical analysis of emails exchanged via oak mailing list
(over a period of six months) to determine if email is indeed used
as a short messaging service
content analysis of emails as microposts, to evaluate to what
degree the knowledge content of truncated or abbreviated
messages can be compared to the complete message
determine if the knowledge content of short emails may be used
to obtain useful information about e.g., topics of interest or
expertise within an organisation, as a basis for carrying out tasks
such as expert ﬁnding or content-based social network analysis
(SNA)


Introduction
Email Corpus
Experiments
Conclusions

Micropost Services

mainly used for social information exchange, but also used in
more formal (working) environments
[Herbsleb et al., 2002, Isaacs et al., 2002]
sometimes perceived negatively in the workplace, as they may be
seen to reduce productivity [TNS US Group, 2009], and/or pose
threats to security and privacy
where restrictions to use are in place, alternatives are sought that
obtain the same beneﬁts
same communication patterns in alternative media: email as a
short message service for communication via, e.g., mailing lists


Introduction
Email Corpus
Experiments
Conclusions

Oak mailing list based corpus
six month period (July - January 2011), totalling 659 emails


Introduction
Email Corpus
Experiments
Conclusions


Goal: evaluate to what degree the knowledge content of a shorter
message can be compared to that of a full message.
Chosen task: text classification on non-predefined topics.
Test bed: generated by preprocessing the oak email corpus to
obtain several fixed-size corpora.
Method:
Corpus-driven topic extraction: a number of topics are
automatically extracted from a document collection; each topic is
represented as a weighted vector of terms;
Document topic classification: each document is labelled with the
topic it is most similar to, and classified into the corresponding
cluster.


Introduction
Email Corpus
Experiments
Conclusions

Topic Extraction: Proximity-based Clustering

Document corpus D = {d1 , ..., dk }
each document di = {t1 , ..., tv } is a vector of weighted terms
Term clusters C = {c1 , ..., ck }
(clustering performed by using as feature space the inverted
index of D)
each cluster ck = {t1 , ..., tn } is a vector of weighted terms
each cluster ideally represents a topic in the document collection


Introduction
Email Corpus
Experiments
Conclusions

Email Topic Classiﬁcation

sim(d , Ci ): similarity between documents and clusters (by cosine
similarity)
labelDoc (D ): each document d is mapped to the topic Ci , which
maximises the similarity sim(d , Ci )


Introduction
Email Corpus
Experiments
Conclusions

Dataset Preparation: Comparable Corpora

Corpus Name Maximum text length of each document
corpus140 truncated at length 140 (if longer, full text otherwise)
mainCorpus full email body


Introduction
Email Corpus
Experiments
Conclusions

Experimental Approach

corpus-driven topic extraction: using mainCorpus
email classiﬁcation: apply the labelDoc procedure to all different
comparable corpora, including mainCorpus.


Introduction
Email Corpus
Experiments
Conclusions

Corpus Topics

Figura: Visualisation of topic clusters.

Introduction
Email Corpus
Experiments
Conclusions

Results

Precision Recall F-Measure
corpus140 0.80 0.63 0.69
corpus280 0.90 0.86 0.87
corpus420 0.93 0.92 0.92
corpus560 0.98 0.96 0.97
corpus700 0.95 0.94 0.94
corpus840 0.98 0.98 0.98
corpus980 0.99 0.99 0.99


Introduction
Email Corpus
Experiments
Conclusions

Conclusions

a fair portion of the emails exchanged, for the corpus generated
from a mailing list, are very short, with approximately 40% falling
within the single micropost size, and 65% up to two microposts;
for the text classification task described, the accuracy of
classification for micropost size texts is an acceptable
approximation of classification performed on longer texts, with a
decrease of only ∼ 5% for up to the second micropost block
within a long e-mail.


Introduction
Email Corpus
Experiments
Conclusions

Future Work

enriching the micro-emails with semantic information (e.g.,
concepts extracted from domain and standard ontologies) would
improve the results obtained using unannotated text
investigate the inﬂuence of other similarity measures
application to expert ﬁnding tasks, exploiting dynamic topic
extraction as a means to determine authors’ and recipients’ areas
of expertise
formal evaluation of topic validity will be required, including the
human (expert) annotator in the loop


Introduction
Email Corpus
Experiments
Conclusions

References

Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002).
Introducing instant messaging and chat in the workplace.
In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages
171–178.

Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002).
The character, functions, and styles of instant messaging in the workplace.
In Proc., ACM conference on Computer supported cooperative work, pages 11–20.

TNS US Group (2009).
Social media exploding: More than 40% use online social networks.
http://www.tns-us.com/news/social_media_exploding_more_than.php.


Does Size Matter? When Small is Good Enough

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (7)

Does Size Matter? When Small is Good Enough