Authorship attribution

Basic Definitions
Stylometry
Features
Algorithms

Authorship Attribution– Reza Ramezani
Danielle Jones
• Last seen 18th June 2001.
• After her disappearance a series of text
messages were sent from her phone.
• Linguistic analysis showed that the later
messages were sent by her Uncle,
Stuart Campbell.
• Campbell was convicted of Danielle’s
murder 19th December 2002 in part
because of the linguistic evidence.
Jenny Nicholl
• Last seen 30th June 2005.
• After her disappearance a series of
text messages were sent from her
phone.
• Linguistic analysis showed that the
later messages were sent by her
classmate, David Hodgson.
Hodgson was convicted
of Jenny’s murder 19th
February 2008 in part
because of the linguistic
evidence.
Importance
26

Authorship Attribution – Reza Ramezani
Authorship Attribution
• Definition
– In the typical authorship attribution problem, a text of unknown authorship is
assigned to one candidate author, given a set of candidate authors for whom text
samples of undisputed authorship are available.
– From a machine learning point of view, this can be viewed as a multiclass, single-
label text-categorization task.
– This task also is called authorship (or author) identification
• Idea
– The main idea behind authorship attribution is by measuring some textual features.
– Authorship attribution is supported by statistical or computational methods.
– This scientific field takes advantage of research advances in areas such as machine
learning, information retrieval, and natural language processing.
4

Authorship Attribution– Reza Ramezani
Supervised Learning
5

Stylometry
• Representation
– Research in authorship attribution are done by attempts to define features for
quantifying writing style, a line of research known as “stylometry”
• Sentence length, word length, word frequencies, character frequencies, and
vocabulary richness functions
– 1,000 different measures was estimated by Rudman (1998)
6

Stylometric Features
• Lexical and Character Features
– Consider a text as a mere sequence of word-tokens or characters, respectively.
• Syntactic and Semantic Features
– Require deeper linguistic analysis
• Application-Specific Features
– Can be defined only in certain text domains or languages.
7

1. Lexical Features
• A simple and natural way to view a text is as a:
– Sequence of tokens grouped into sentences, with each token corresponding to a
word, number, or punctuation mark.
• Method
– The very first attempts to attribute authorship were based on simple measures
such as sentence length counts and word length counts.
• Advantage
– They can be applied to any language and any corpus with no additional
requirements except the availability of a tokenizer.
• The most straightforward approach to represent texts is by vectors of word
frequencies (vast majority of authorship attribution studies).
8

Classification & Authorship Attribution
• Difference in style-based and topic-based text classification
– The most common words (articles, prepositions, pronouns, etc.) are found to be
among the best features to discriminate between authors.
– Such words are usually excluded from the feature set of the topic-based text-
classification methods since they do not carry any semantic information, and they
are usually called “function words”.
• Various sets of function words have been used for English, but limited information
was provided about the way that they were selected: 150, 303, 365, 480 and 675
9

Methods
• A Simple Method
– Extract the most frequent words found in the available corpus (comprising all the
texts of the candidate authors).
– Then, a decision has to be made about the amount of the frequent words that will
be used as features.
• In the earlier studies, sets of at most 100 frequent words were considered adequate
to represent the style of an author.
– Another factor that affects the feature-set size is the classification algorithm that
will be used since many algorithms over-fit the training data when the
dimensionality of the problem increases.
– Some machine learning algorithm (Such as SVM) can deal with thousands of
features.
10

Methods (Cont’d)
• Required Routines
– Tokenizer (Word Extraction)
– Conversion to lowercase
– Stemmers
– Lemmatizers
– Detectors of common homographic forms
• Disadvantages
– The bag-of-words approach provides a simple and efficient solution, but
disregards word-order (i.e., contextual) information.
• One Possible Solution
– n-grams
11

Methods (Cont’d)
• n-grams
– n contiguous words also known as word collocations
• Features
– The dimensionality of the problem following this approach increases considerably
with n to account for all the possible combinations between words.
– The representation produced by this approach is very sparse since most of the
word combinations are not encountered in a given (especially short) text, making it
very difficult to be handled effectively by a classification algorithm.
– The classification accuracy achieved by word n-grams is not always better than
individual word features.
12

Methods (Cont’d)
• Writing Error Measures
– Spelling errors
• Letter omissions
• Insertions
– Formatting errors
• “all caps” words
– This method needs an accurate spell checker.
13

Methods
• Vocabulary Richness
– The vocabulary richness functions are attempts to quantify the diversity of the
vocabulary of a text.
– Typical examples are the type-token ratio V/N, where V is the size of the
vocabulary (unique tokens) and N is the total number of tokens of the text.
• Unreliable Measures
– Vocabulary size (V) depends heavily on text length (as the text length increases, the
vocabulary also increases, quickly at the beginning and then more and more
slowly).
– Various functions have been proposed to achieve stability over text length,
including K (Yule, 1944), and R (Honore, 1979), with questionable results.
14

2. Character Features
• A text is viewed as a mere sequence of characters.
• Character-level Measures
– Alphabetic characters count
– Digit characters count
– Uppercase and lowercase characters count
– Punctuation marks count
– And so on …
• Feature
– This type of information is easily available for any natural language and corpus
– It has been proven to be quite useful to quantify the writing style.
15

Methods
• Character n-gram
– Extract frequencies of n-grams on the character level.
• Features
– An advantage of this representation is its ability to be tolerant to noise.
– In cases of lexicon errors the character n-gram representation is not affected
dramatically.
• The words “simplistic” and “simpilstc” would produce many common character
3-grams.
– For oriental languages where the tokenization procedure is quite hard, character
n-grams offers a suitable solution
– The procedure of extracting the most frequent n-grams is language-independent
and requires no special tools.
• Compression-based Approaches
– Will be discussed later …
16

3. Syntactic Features
• Employing syntactic information
• Idea
– The idea is that authors tend to unconsciously use similar syntactic patterns.
– Therefore, syntactic information is considered more a reliable authorial
fingerprint in comparison to lexical information.
– This type of information requires robust and accurate NLP tools able to
perform syntactic analysis of texts.
• The syntactic measure extraction is a language dependent procedure
• Such features will produce noisy datasets due to unavoidable errors made by
the parser.
17

Methods
• Rewrite Rule
– Extracting Rewrite Rule frequencies, using a produced full parse tree of each
sentence.
– Using Rewrite Rules to analysis parts of syntactic.
– Consider the following rewrite rule:
A : PP → P : PREP + PC : NP
– It means that an adverbial prepositional phrase is constituted by a preposition
followed by a noun phrase as a prepositional complement.
– These information describe “how the words are combined to form phrases or other
structures”.
– Experimental results have shown that this type of measure performs better than do
Lexical and Characters features.
• It needs accurate fully automated parser, able to provide a detailed syntactic
analysis of sentences.
18

Methods (Cont’d)
• Paragraph Analyze
– Another attempt to exploit syntactic information was proposed by Stamatatos
– This sentence would be analyzed as following:
– NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed]
PP[by Stamatatos]
– Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional
phrase, respectively.
– This type of information is simpler than Rewrite Rules.
– It could be extracted automatically with relatively high accuracy.
– The extracted measures referred to noun phrase counts, verb phrase counts,
length of noun phrases, length of verb phrases, and so on…
19

Methods (Cont’d)
• Part-of-Speech (POS)
– (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each
word-token based on contextual information.
– Several researchers have used POS tag frequencies or POS tag n-gram frequencies
to represent style
– POS tag information provides only a hint of the structural analysis of sentences
since it is not clear:
• How the words are combined to form phrases
• How the phrases are combined into higher level structures
20

4. Semantic Features
• Low-Level vs. High-Level
– Previous methods are at context level (low-level), not semantic level (high-level)
– NLP tools can be applied successfully to low-level tasks such as:
• Sentence splitting
• POS tagging
• Text chunking
• Partial parsing,
– More complicated tasks cannot yet be handled adequately by current NLP
technology for unrestricted text. such as:
• Full syntactic parsing
• Semantic analysis
• Pragmatic analysis
– As a result, very few attempts have been made to exploit high-level features for
stylometric purposes.
21

Semantic Features Tools
• Produce semantic dependency graphs
– Gamon, Michael. "Linguistic correlates of style: authorship classification with
deep linguistic analysis features." In Proceedings of the 20th international
conference on Computational Linguistics, p. 611. Association for Computational
Linguistics, 2004.
• Extracting semantic measures based on WordNet
– McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S.
McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the
Florida Artificial Intelligence Research Society International Conference
(FLAIRS), pp. 764-769. 2006.
• Using the theory of Systemic Functional Grammar (SFG)
– Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu
Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical
features." Journal of the American Society for Information Science and
Technology 58, no. 6 (2007): 802-822.
22

5. Application-Specific Features
• Application-Specific Features
– One can define application-specific measures to better represent the nuances of
style in a given text domain.
– Defining structural measures to quantify the authorial style in special domain.
• Such as e-mail messages and online-forum messages
– Structural measures include:
• The use of greetings and farewells in the messages,
• Types of signatures,
• Use of indentation,
• Paragraph length,
• and so on …
– Other types of application-specific features can be defined only for certain natural
languages, such as Greek.
23

Attribution Methods
• Profile-Based Approaches
– Probabilistic models
– Compression models
– Common n-grams and variants
• Instance-Based Approaches
– Vector space models
– Similarity-based models
– Meta-learning models
• Hybrid Approaches
– Average Methods
24

Attribution Methods (Cont’d)
• Profile-Based Approaches
– Cumulatively (per author)
• Concatenating all the available training texts per author in one big file
(author’s profile)
• The stylometric measures extracted from the concatenated file may be quite
different in comparison to each of the original training texts.
• Instance-Based Approaches
– Individually (per author)
25

1. Profile-Based Approaches
26

1.1. Probabilistic Models
27

1.2. Compression Models
• Compression Models
– Such methods do not produce a concrete vector representation of the author’s
profile.
• Steps
– Initially a compression algorithm is called to produce a compressed file C(xa).
– Then, the unseen text x is added to each text xa, and the compression algorithm is
called again for each C(xa +x).
– The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa)
indicates the similarity of the unseen text with each candidate author.
– These models are applied only to character sequences, not word sequences.
28

1.3. Common n-grams (CNG)
29

1.3. Common n-grams (CNG) (Cont’d)
• Parameters
– The CNG method has two important parameters that should be tuned:
– The profile size L
• How many strings constitute the profile.
– And the character n-gram length n;
• How long strings constitute the profile.
– Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5.
– The CNG distance function performs well when the training corpus is relatively
balanced.
– But it fails in imbalanced cases where at least one author’s profile is shorter than L.
• CNG variant
– To solve the problem of class imbalanced.
30

2. Instance-Based Approaches
31

2.1. Vector Space Models
• Definition
– It could be considered each text as a vector in a multivariate space.
– Then, a variety of powerful statistical and machine learning algorithms can be
used to build a classification model, including:
• Discriminant Analysis
• SVM
• Decision Trees
• Neural Networks
• Genetic Algorithms
• Memory-based Learners
• Classifier Ensemble Methods
• and so on.
– Some of these algorithms can effectively handle high-dimensional, noisy, and
sparse data, allowing more expressive representations of texts.
– The effectiveness of methods is diminished by the presence of the class-imbalance.
32

2.2. Similarity-based Models
• Idea
– Calculation of pairwise similarity measures between the unseen text and all the
training texts,
– And then estimating the most likely author based on a nearest-neighbor algorithm.
• Example
– Compression Model
• Compressing of each training text in separate files using an off-the-shelf algorithm
• C(x) is the bit-wise size of the compression of file x
• The difference C(x +y) − C(x) indicates the similarity of a training text x with the
unseen text y.
33

2.3. Meta-learning Models
• Definition
– More complex algorithms specifically designed for authorship attribution.
– The main goal is to use such meta-data to understand how automatic learning can
become flexible in solving different kinds of learning problems:
• Hence to improve the performance of existing learning algorithms.
– The most interesting approach of this kind is the unmasking method.
34

Unmasking Method
• Unmasking Method
– In the unmasking method, For each unseen text, an SVM classifier is built to
discriminate it from the training texts of each candidate author.
– Thus, for n candidate authors, n classifiers for each unseen text is built.
– Then, in an iterative procedure, a predefined amount of the most important
features for each classifier is removed and the drop in accuracy is measured.
– At the beginning, all the classifiers had more or less the same very high accuracy.
– After a few iterations, the accuracy of the classifier that discriminates between the
unseen text and the true author would be too low while the accuracy of the other
classifiers would remain relatively high.
– This happens because the differences between the unseen text and the other
authors are manifold, so by removing a few features, the accuracy is not affected
dramatically.
35

3. Hybrid Approaches
• Hybrid Approaches
– Methods that borrow some elements from both profile-based and instance-based
approaches.
• Example
– All the training text samples are represented separately, as it happens with the
instance-based approaches.
– The representation vectors for the texts of each author are feature-wisely averaged
and produced a single profile vector for each author, as happens with the profile-
based approaches.
– The distance of the profile of an unseen text from the profile of each author is then
calculated by a weighted feature-wise function.
36

37

Authorship attribution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Authorship attribution

Similar to Authorship attribution (20)

More from Reza Ramezani

More from Reza Ramezani (9)

Recently uploaded

Recently uploaded (20)

Authorship attribution

Editor's Notes