SlideShare a Scribd company logo
1 of 37
Download to read offline
Basic Definitions
Stylometry
Features
Algorithms
Authorship Attribution– Reza Ramezani
Danielle Jones
• Last seen 18th June 2001.
• After her disappearance a series of text
messages were sent from her phone.
• Linguistic analysis showed that the later
messages were sent by her Uncle,
Stuart Campbell.
• Campbell was convicted of Danielle’s
murder 19th December 2002 in part
because of the linguistic evidence.
Jenny Nicholl
• Last seen 30th June 2005.
• After her disappearance a series of
text messages were sent from her
phone.
• Linguistic analysis showed that the
later messages were sent by her
classmate, David Hodgson.
Hodgson was convicted
of Jenny’s murder 19th
February 2008 in part
because of the linguistic
evidence.
Importance
26
Authorship Attribution – Reza Ramezani
Authorship Attribution
• Definition
– In the typical authorship attribution problem, a text of unknown authorship is
assigned to one candidate author, given a set of candidate authors for whom text
samples of undisputed authorship are available.
– From a machine learning point of view, this can be viewed as a multiclass, single-
label text-categorization task.
– This task also is called authorship (or author) identification
• Idea
– The main idea behind authorship attribution is by measuring some textual features.
– Authorship attribution is supported by statistical or computational methods.
– This scientific field takes advantage of research advances in areas such as machine
learning, information retrieval, and natural language processing.
4
Authorship Attribution– Reza Ramezani
Supervised Learning
5
Authorship Attribution – Reza Ramezani
Stylometry
• Representation
– Research in authorship attribution are done by attempts to define features for
quantifying writing style, a line of research known as “stylometry”
• Sentence length, word length, word frequencies, character frequencies, and
vocabulary richness functions
– 1,000 different measures was estimated by Rudman (1998)
6
Authorship Attribution – Reza Ramezani
Stylometric Features
• Lexical and Character Features
– Consider a text as a mere sequence of word-tokens or characters, respectively.
• Syntactic and Semantic Features
– Require deeper linguistic analysis
• Application-Specific Features
– Can be defined only in certain text domains or languages.
7
Authorship Attribution – Reza Ramezani
1. Lexical Features
• A simple and natural way to view a text is as a:
– Sequence of tokens grouped into sentences, with each token corresponding to a
word, number, or punctuation mark.
• Method
– The very first attempts to attribute authorship were based on simple measures
such as sentence length counts and word length counts.
• Advantage
– They can be applied to any language and any corpus with no additional
requirements except the availability of a tokenizer.
• The most straightforward approach to represent texts is by vectors of word
frequencies (vast majority of authorship attribution studies).
8
Authorship Attribution – Reza Ramezani
Classification & Authorship Attribution
• Difference in style-based and topic-based text classification
– The most common words (articles, prepositions, pronouns, etc.) are found to be
among the best features to discriminate between authors.
– Such words are usually excluded from the feature set of the topic-based text-
classification methods since they do not carry any semantic information, and they
are usually called “function words”.
• Various sets of function words have been used for English, but limited information
was provided about the way that they were selected: 150, 303, 365, 480 and 675
9
Authorship Attribution – Reza Ramezani
Methods
• A Simple Method
– Extract the most frequent words found in the available corpus (comprising all the
texts of the candidate authors).
– Then, a decision has to be made about the amount of the frequent words that will
be used as features.
• In the earlier studies, sets of at most 100 frequent words were considered adequate
to represent the style of an author.
– Another factor that affects the feature-set size is the classification algorithm that
will be used since many algorithms over-fit the training data when the
dimensionality of the problem increases.
– Some machine learning algorithm (Such as SVM) can deal with thousands of
features.
10
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Required Routines
– Tokenizer (Word Extraction)
– Conversion to lowercase
– Stemmers
– Lemmatizers
– Detectors of common homographic forms
• Disadvantages
– The bag-of-words approach provides a simple and efficient solution, but
disregards word-order (i.e., contextual) information.
• One Possible Solution
– n-grams
11
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• n-grams
– n contiguous words also known as word collocations
• Features
– The dimensionality of the problem following this approach increases considerably
with n to account for all the possible combinations between words.
– The representation produced by this approach is very sparse since most of the
word combinations are not encountered in a given (especially short) text, making it
very difficult to be handled effectively by a classification algorithm.
– The classification accuracy achieved by word n-grams is not always better than
individual word features.
12
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Writing Error Measures
– Spelling errors
• Letter omissions
• Insertions
– Formatting errors
• “all caps” words
– This method needs an accurate spell checker.
13
Authorship Attribution – Reza Ramezani
Methods
• Vocabulary Richness
– The vocabulary richness functions are attempts to quantify the diversity of the
vocabulary of a text.
– Typical examples are the type-token ratio V/N, where V is the size of the
vocabulary (unique tokens) and N is the total number of tokens of the text.
• Unreliable Measures
– Vocabulary size (V) depends heavily on text length (as the text length increases, the
vocabulary also increases, quickly at the beginning and then more and more
slowly).
– Various functions have been proposed to achieve stability over text length,
including K (Yule, 1944), and R (Honore, 1979), with questionable results.
14
Authorship Attribution – Reza Ramezani
2. Character Features
• A text is viewed as a mere sequence of characters.
• Character-level Measures
– Alphabetic characters count
– Digit characters count
– Uppercase and lowercase characters count
– Punctuation marks count
– And so on …
• Feature
– This type of information is easily available for any natural language and corpus
– It has been proven to be quite useful to quantify the writing style.
15
Authorship Attribution – Reza Ramezani
Methods
• Character n-gram
– Extract frequencies of n-grams on the character level.
• Features
– An advantage of this representation is its ability to be tolerant to noise.
– In cases of lexicon errors the character n-gram representation is not affected
dramatically.
• The words “simplistic” and “simpilstc” would produce many common character
3-grams.
– For oriental languages where the tokenization procedure is quite hard, character
n-grams offers a suitable solution
– The procedure of extracting the most frequent n-grams is language-independent
and requires no special tools.
• Compression-based Approaches
– Will be discussed later …
16
Authorship Attribution – Reza Ramezani
3. Syntactic Features
• Employing syntactic information
• Idea
– The idea is that authors tend to unconsciously use similar syntactic patterns.
– Therefore, syntactic information is considered more a reliable authorial
fingerprint in comparison to lexical information.
– This type of information requires robust and accurate NLP tools able to
perform syntactic analysis of texts.
• The syntactic measure extraction is a language dependent procedure
• Such features will produce noisy datasets due to unavoidable errors made by
the parser.
17
Authorship Attribution – Reza Ramezani
Methods
• Rewrite Rule
– Extracting Rewrite Rule frequencies, using a produced full parse tree of each
sentence.
– Using Rewrite Rules to analysis parts of syntactic.
– Consider the following rewrite rule:
A : PP → P : PREP + PC : NP
– It means that an adverbial prepositional phrase is constituted by a preposition
followed by a noun phrase as a prepositional complement.
– These information describe “how the words are combined to form phrases or other
structures”.
– Experimental results have shown that this type of measure performs better than do
Lexical and Characters features.
• It needs accurate fully automated parser, able to provide a detailed syntactic
analysis of sentences.
18
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Paragraph Analyze
– Another attempt to exploit syntactic information was proposed by Stamatatos
– This sentence would be analyzed as following:
– NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed]
PP[by Stamatatos]
– Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional
phrase, respectively.
– This type of information is simpler than Rewrite Rules.
– It could be extracted automatically with relatively high accuracy.
– The extracted measures referred to noun phrase counts, verb phrase counts,
length of noun phrases, length of verb phrases, and so on…
19
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Part-of-Speech (POS)
– (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each
word-token based on contextual information.
– Several researchers have used POS tag frequencies or POS tag n-gram frequencies
to represent style
– POS tag information provides only a hint of the structural analysis of sentences
since it is not clear:
• How the words are combined to form phrases
• How the phrases are combined into higher level structures
20
Authorship Attribution – Reza Ramezani
4. Semantic Features
• Low-Level vs. High-Level
– Previous methods are at context level (low-level), not semantic level (high-level)
– NLP tools can be applied successfully to low-level tasks such as:
• Sentence splitting
• POS tagging
• Text chunking
• Partial parsing,
– More complicated tasks cannot yet be handled adequately by current NLP
technology for unrestricted text. such as:
• Full syntactic parsing
• Semantic analysis
• Pragmatic analysis
– As a result, very few attempts have been made to exploit high-level features for
stylometric purposes.
21
Authorship Attribution – Reza Ramezani
Semantic Features Tools
• Produce semantic dependency graphs
– Gamon, Michael. "Linguistic correlates of style: authorship classification with
deep linguistic analysis features." In Proceedings of the 20th international
conference on Computational Linguistics, p. 611. Association for Computational
Linguistics, 2004.
• Extracting semantic measures based on WordNet
– McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S.
McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the
Florida Artificial Intelligence Research Society International Conference
(FLAIRS), pp. 764-769. 2006.
• Using the theory of Systemic Functional Grammar (SFG)
– Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu
Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical
features." Journal of the American Society for Information Science and
Technology 58, no. 6 (2007): 802-822.
22
Authorship Attribution – Reza Ramezani
5. Application-Specific Features
• Application-Specific Features
– One can define application-specific measures to better represent the nuances of
style in a given text domain.
– Defining structural measures to quantify the authorial style in special domain.
• Such as e-mail messages and online-forum messages
– Structural measures include:
• The use of greetings and farewells in the messages,
• Types of signatures,
• Use of indentation,
• Paragraph length,
• and so on …
– Other types of application-specific features can be defined only for certain natural
languages, such as Greek.
23
Authorship Attribution – Reza Ramezani
Attribution Methods
• Profile-Based Approaches
– Probabilistic models
– Compression models
– Common n-grams and variants
• Instance-Based Approaches
– Vector space models
– Similarity-based models
– Meta-learning models
• Hybrid Approaches
– Average Methods
24
Authorship Attribution – Reza Ramezani
Attribution Methods (Cont’d)
• Profile-Based Approaches
– Cumulatively (per author)
• Concatenating all the available training texts per author in one big file
(author’s profile)
• The stylometric measures extracted from the concatenated file may be quite
different in comparison to each of the original training texts.
• Instance-Based Approaches
– Individually (per author)
25
Authorship Attribution – Reza Ramezani
1. Profile-Based Approaches
26
Authorship Attribution – Reza Ramezani
1.1. Probabilistic Models
27
Authorship Attribution – Reza Ramezani
1.2. Compression Models
• Compression Models
– Such methods do not produce a concrete vector representation of the author’s
profile.
• Steps
– Initially a compression algorithm is called to produce a compressed file C(xa).
– Then, the unseen text x is added to each text xa, and the compression algorithm is
called again for each C(xa +x).
– The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa)
indicates the similarity of the unseen text with each candidate author.
– These models are applied only to character sequences, not word sequences.
28
Authorship Attribution – Reza Ramezani
1.3. Common n-grams (CNG)
29
Authorship Attribution – Reza Ramezani
1.3. Common n-grams (CNG) (Cont’d)
• Parameters
– The CNG method has two important parameters that should be tuned:
– The profile size L
• How many strings constitute the profile.
– And the character n-gram length n;
• How long strings constitute the profile.
– Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5.
– The CNG distance function performs well when the training corpus is relatively
balanced.
– But it fails in imbalanced cases where at least one author’s profile is shorter than L.
• CNG variant
– To solve the problem of class imbalanced.
30
Authorship Attribution – Reza Ramezani
2. Instance-Based Approaches
31
Authorship Attribution – Reza Ramezani
2.1. Vector Space Models
• Definition
– It could be considered each text as a vector in a multivariate space.
– Then, a variety of powerful statistical and machine learning algorithms can be
used to build a classification model, including:
• Discriminant Analysis
• SVM
• Decision Trees
• Neural Networks
• Genetic Algorithms
• Memory-based Learners
• Classifier Ensemble Methods
• and so on.
– Some of these algorithms can effectively handle high-dimensional, noisy, and
sparse data, allowing more expressive representations of texts.
– The effectiveness of methods is diminished by the presence of the class-imbalance.
32
Authorship Attribution – Reza Ramezani
2.2. Similarity-based Models
• Idea
– Calculation of pairwise similarity measures between the unseen text and all the
training texts,
– And then estimating the most likely author based on a nearest-neighbor algorithm.
• Example
– Compression Model
• Compressing of each training text in separate files using an off-the-shelf algorithm
• C(x) is the bit-wise size of the compression of file x
• The difference C(x +y) − C(x) indicates the similarity of a training text x with the
unseen text y.
33
Authorship Attribution – Reza Ramezani
2.3. Meta-learning Models
• Definition
– More complex algorithms specifically designed for authorship attribution.
– The main goal is to use such meta-data to understand how automatic learning can
become flexible in solving different kinds of learning problems:
• Hence to improve the performance of existing learning algorithms.
– The most interesting approach of this kind is the unmasking method.
34
Authorship Attribution – Reza Ramezani
Unmasking Method
• Unmasking Method
– In the unmasking method, For each unseen text, an SVM classifier is built to
discriminate it from the training texts of each candidate author.
– Thus, for n candidate authors, n classifiers for each unseen text is built.
– Then, in an iterative procedure, a predefined amount of the most important
features for each classifier is removed and the drop in accuracy is measured.
– At the beginning, all the classifiers had more or less the same very high accuracy.
– After a few iterations, the accuracy of the classifier that discriminates between the
unseen text and the true author would be too low while the accuracy of the other
classifiers would remain relatively high.
– This happens because the differences between the unseen text and the other
authors are manifold, so by removing a few features, the accuracy is not affected
dramatically.
35
Authorship Attribution – Reza Ramezani
3. Hybrid Approaches
• Hybrid Approaches
– Methods that borrow some elements from both profile-based and instance-based
approaches.
• Example
– All the training text samples are represented separately, as it happens with the
instance-based approaches.
– The representation vectors for the texts of each author are feature-wisely averaged
and produced a single profile vector for each author, as happens with the profile-
based approaches.
– The distance of the profile of an unseen text from the profile of each author is then
calculated by a weighted feature-wise function.
36
Authorship Attribution – Reza Ramezani
37

More Related Content

What's hot

Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics mimizin
 
Language policy dilemma in pakistan
Language policy dilemma in pakistanLanguage policy dilemma in pakistan
Language policy dilemma in pakistanAhdi Hassan
 
Forensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaForensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaG.P.G.C Mardan
 
The Linguistic Variables
The Linguistic VariablesThe Linguistic Variables
The Linguistic VariablesDr. Cupid Lucid
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics dentpress
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsAlicia Ruiz
 
Sapir Whorf hypothesis
Sapir Whorf hypothesisSapir Whorf hypothesis
Sapir Whorf hypothesisAhmet Ateş
 
Linguistics presentacion
Linguistics presentacionLinguistics presentacion
Linguistics presentacionFranklin Pérez
 
An introduction to systemic functional linguistics
An introduction to systemic functional linguisticsAn introduction to systemic functional linguistics
An introduction to systemic functional linguisticsiendah lestari
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysisVivaAs
 
Interlanguage and interlanguage theory
Interlanguage and interlanguage theoryInterlanguage and interlanguage theory
Interlanguage and interlanguage theoryAbibAfzal
 
What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?Shajaira Lopez
 
Language revitalization presentation
Language revitalization presentationLanguage revitalization presentation
Language revitalization presentationkashir6142
 

What's hot (20)

Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics
 
Language policy dilemma in pakistan
Language policy dilemma in pakistanLanguage policy dilemma in pakistan
Language policy dilemma in pakistan
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Forensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaForensic linguistics ppt by roshna
Forensic linguistics ppt by roshna
 
The Linguistic Variables
The Linguistic VariablesThe Linguistic Variables
The Linguistic Variables
 
Systemic Functional Linguistics
Systemic Functional LinguisticsSystemic Functional Linguistics
Systemic Functional Linguistics
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Linguistics
LinguisticsLinguistics
Linguistics
 
Sapir Whorf hypothesis
Sapir Whorf hypothesisSapir Whorf hypothesis
Sapir Whorf hypothesis
 
Linguistics presentacion
Linguistics presentacionLinguistics presentacion
Linguistics presentacion
 
An introduction to systemic functional linguistics
An introduction to systemic functional linguisticsAn introduction to systemic functional linguistics
An introduction to systemic functional linguistics
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 
Interlanguage and interlanguage theory
Interlanguage and interlanguage theoryInterlanguage and interlanguage theory
Interlanguage and interlanguage theory
 
What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?
 
Language revitalization presentation
Language revitalization presentationLanguage revitalization presentation
Language revitalization presentation
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Forensic Linguistics
Forensic LinguisticsForensic Linguistics
Forensic Linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Approaches to discourse
Approaches to discourseApproaches to discourse
Approaches to discourse
 

Viewers also liked

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...osify
 
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attributionJenny Delasalle
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...Maarten van Wesel
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...Ahmed Mater
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
 
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами NLProc.by
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...yosra Yassora
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detectionankit_saluja
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLPbutest
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniquesNimisha T
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applicationsdahveed123
 

Viewers also liked (17)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Chernyak_defense
Chernyak_defenseChernyak_defense
Chernyak_defense
 
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attribution
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Plag detection
Plag detectionPlag detection
Plag detection
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
 

Similar to Authorship attribution

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmDhruvKushwaha12
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentationPREETHIRRA2011003040
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingHitesh Joshi
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 

Similar to Authorship attribution (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Incrementality
IncrementalityIncrementality
Incrementality
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language Processing
 
IR
IRIR
IR
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Text summarization
Text summarization Text summarization
Text summarization
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
LiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word FinderLiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word Finder
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 

More from Reza Ramezani

Real time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReal time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReza Ramezani
 
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time schedulingReza Ramezani
 
An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)Reza Ramezani
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Reza Ramezani
 
Deadlock detection in distributed systems
Deadlock detection in distributed systemsDeadlock detection in distributed systems
Deadlock detection in distributed systemsReza Ramezani
 
Fault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemFault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemReza Ramezani
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
 
Finding Association Rules in Linked Data
Finding Association Rules in Linked DataFinding Association Rules in Linked Data
Finding Association Rules in Linked DataReza Ramezani
 

More from Reza Ramezani (9)

Real time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReal time operating systems for safety-critical applications
Real time operating systems for safety-critical applications
 
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time scheduling
 
An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...
 
Deadlock detection in distributed systems
Deadlock detection in distributed systemsDeadlock detection in distributed systems
Deadlock detection in distributed systems
 
Fault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemFault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector system
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
Finding Association Rules in Linked Data
Finding Association Rules in Linked DataFinding Association Rules in Linked Data
Finding Association Rules in Linked Data
 

Recently uploaded

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Recently uploaded (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Authorship attribution

  • 1.
  • 3. Authorship Attribution– Reza Ramezani Danielle Jones • Last seen 18th June 2001. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her Uncle, Stuart Campbell. • Campbell was convicted of Danielle’s murder 19th December 2002 in part because of the linguistic evidence. Jenny Nicholl • Last seen 30th June 2005. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her classmate, David Hodgson. Hodgson was convicted of Jenny’s murder 19th February 2008 in part because of the linguistic evidence. Importance 26
  • 4. Authorship Attribution – Reza Ramezani Authorship Attribution • Definition – In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available. – From a machine learning point of view, this can be viewed as a multiclass, single- label text-categorization task. – This task also is called authorship (or author) identification • Idea – The main idea behind authorship attribution is by measuring some textual features. – Authorship attribution is supported by statistical or computational methods. – This scientific field takes advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. 4
  • 5. Authorship Attribution– Reza Ramezani Supervised Learning 5
  • 6. Authorship Attribution – Reza Ramezani Stylometry • Representation – Research in authorship attribution are done by attempts to define features for quantifying writing style, a line of research known as “stylometry” • Sentence length, word length, word frequencies, character frequencies, and vocabulary richness functions – 1,000 different measures was estimated by Rudman (1998) 6
  • 7. Authorship Attribution – Reza Ramezani Stylometric Features • Lexical and Character Features – Consider a text as a mere sequence of word-tokens or characters, respectively. • Syntactic and Semantic Features – Require deeper linguistic analysis • Application-Specific Features – Can be defined only in certain text domains or languages. 7
  • 8. Authorship Attribution – Reza Ramezani 1. Lexical Features • A simple and natural way to view a text is as a: – Sequence of tokens grouped into sentences, with each token corresponding to a word, number, or punctuation mark. • Method – The very first attempts to attribute authorship were based on simple measures such as sentence length counts and word length counts. • Advantage – They can be applied to any language and any corpus with no additional requirements except the availability of a tokenizer. • The most straightforward approach to represent texts is by vectors of word frequencies (vast majority of authorship attribution studies). 8
  • 9. Authorship Attribution – Reza Ramezani Classification & Authorship Attribution • Difference in style-based and topic-based text classification – The most common words (articles, prepositions, pronouns, etc.) are found to be among the best features to discriminate between authors. – Such words are usually excluded from the feature set of the topic-based text- classification methods since they do not carry any semantic information, and they are usually called “function words”. • Various sets of function words have been used for English, but limited information was provided about the way that they were selected: 150, 303, 365, 480 and 675 9
  • 10. Authorship Attribution – Reza Ramezani Methods • A Simple Method – Extract the most frequent words found in the available corpus (comprising all the texts of the candidate authors). – Then, a decision has to be made about the amount of the frequent words that will be used as features. • In the earlier studies, sets of at most 100 frequent words were considered adequate to represent the style of an author. – Another factor that affects the feature-set size is the classification algorithm that will be used since many algorithms over-fit the training data when the dimensionality of the problem increases. – Some machine learning algorithm (Such as SVM) can deal with thousands of features. 10
  • 11. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Required Routines – Tokenizer (Word Extraction) – Conversion to lowercase – Stemmers – Lemmatizers – Detectors of common homographic forms • Disadvantages – The bag-of-words approach provides a simple and efficient solution, but disregards word-order (i.e., contextual) information. • One Possible Solution – n-grams 11
  • 12. Authorship Attribution – Reza Ramezani Methods (Cont’d) • n-grams – n contiguous words also known as word collocations • Features – The dimensionality of the problem following this approach increases considerably with n to account for all the possible combinations between words. – The representation produced by this approach is very sparse since most of the word combinations are not encountered in a given (especially short) text, making it very difficult to be handled effectively by a classification algorithm. – The classification accuracy achieved by word n-grams is not always better than individual word features. 12
  • 13. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Writing Error Measures – Spelling errors • Letter omissions • Insertions – Formatting errors • “all caps” words – This method needs an accurate spell checker. 13
  • 14. Authorship Attribution – Reza Ramezani Methods • Vocabulary Richness – The vocabulary richness functions are attempts to quantify the diversity of the vocabulary of a text. – Typical examples are the type-token ratio V/N, where V is the size of the vocabulary (unique tokens) and N is the total number of tokens of the text. • Unreliable Measures – Vocabulary size (V) depends heavily on text length (as the text length increases, the vocabulary also increases, quickly at the beginning and then more and more slowly). – Various functions have been proposed to achieve stability over text length, including K (Yule, 1944), and R (Honore, 1979), with questionable results. 14
  • 15. Authorship Attribution – Reza Ramezani 2. Character Features • A text is viewed as a mere sequence of characters. • Character-level Measures – Alphabetic characters count – Digit characters count – Uppercase and lowercase characters count – Punctuation marks count – And so on … • Feature – This type of information is easily available for any natural language and corpus – It has been proven to be quite useful to quantify the writing style. 15
  • 16. Authorship Attribution – Reza Ramezani Methods • Character n-gram – Extract frequencies of n-grams on the character level. • Features – An advantage of this representation is its ability to be tolerant to noise. – In cases of lexicon errors the character n-gram representation is not affected dramatically. • The words “simplistic” and “simpilstc” would produce many common character 3-grams. – For oriental languages where the tokenization procedure is quite hard, character n-grams offers a suitable solution – The procedure of extracting the most frequent n-grams is language-independent and requires no special tools. • Compression-based Approaches – Will be discussed later … 16
  • 17. Authorship Attribution – Reza Ramezani 3. Syntactic Features • Employing syntactic information • Idea – The idea is that authors tend to unconsciously use similar syntactic patterns. – Therefore, syntactic information is considered more a reliable authorial fingerprint in comparison to lexical information. – This type of information requires robust and accurate NLP tools able to perform syntactic analysis of texts. • The syntactic measure extraction is a language dependent procedure • Such features will produce noisy datasets due to unavoidable errors made by the parser. 17
  • 18. Authorship Attribution – Reza Ramezani Methods • Rewrite Rule – Extracting Rewrite Rule frequencies, using a produced full parse tree of each sentence. – Using Rewrite Rules to analysis parts of syntactic. – Consider the following rewrite rule: A : PP → P : PREP + PC : NP – It means that an adverbial prepositional phrase is constituted by a preposition followed by a noun phrase as a prepositional complement. – These information describe “how the words are combined to form phrases or other structures”. – Experimental results have shown that this type of measure performs better than do Lexical and Characters features. • It needs accurate fully automated parser, able to provide a detailed syntactic analysis of sentences. 18
  • 19. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Paragraph Analyze – Another attempt to exploit syntactic information was proposed by Stamatatos – This sentence would be analyzed as following: – NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed] PP[by Stamatatos] – Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional phrase, respectively. – This type of information is simpler than Rewrite Rules. – It could be extracted automatically with relatively high accuracy. – The extracted measures referred to noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, and so on… 19
  • 20. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Part-of-Speech (POS) – (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each word-token based on contextual information. – Several researchers have used POS tag frequencies or POS tag n-gram frequencies to represent style – POS tag information provides only a hint of the structural analysis of sentences since it is not clear: • How the words are combined to form phrases • How the phrases are combined into higher level structures 20
  • 21. Authorship Attribution – Reza Ramezani 4. Semantic Features • Low-Level vs. High-Level – Previous methods are at context level (low-level), not semantic level (high-level) – NLP tools can be applied successfully to low-level tasks such as: • Sentence splitting • POS tagging • Text chunking • Partial parsing, – More complicated tasks cannot yet be handled adequately by current NLP technology for unrestricted text. such as: • Full syntactic parsing • Semantic analysis • Pragmatic analysis – As a result, very few attempts have been made to exploit high-level features for stylometric purposes. 21
  • 22. Authorship Attribution – Reza Ramezani Semantic Features Tools • Produce semantic dependency graphs – Gamon, Michael. "Linguistic correlates of style: authorship classification with deep linguistic analysis features." In Proceedings of the 20th international conference on Computational Linguistics, p. 611. Association for Computational Linguistics, 2004. • Extracting semantic measures based on WordNet – McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S. McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS), pp. 764-769. 2006. • Using the theory of Systemic Functional Grammar (SFG) – Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical features." Journal of the American Society for Information Science and Technology 58, no. 6 (2007): 802-822. 22
  • 23. Authorship Attribution – Reza Ramezani 5. Application-Specific Features • Application-Specific Features – One can define application-specific measures to better represent the nuances of style in a given text domain. – Defining structural measures to quantify the authorial style in special domain. • Such as e-mail messages and online-forum messages – Structural measures include: • The use of greetings and farewells in the messages, • Types of signatures, • Use of indentation, • Paragraph length, • and so on … – Other types of application-specific features can be defined only for certain natural languages, such as Greek. 23
  • 24. Authorship Attribution – Reza Ramezani Attribution Methods • Profile-Based Approaches – Probabilistic models – Compression models – Common n-grams and variants • Instance-Based Approaches – Vector space models – Similarity-based models – Meta-learning models • Hybrid Approaches – Average Methods 24
  • 25. Authorship Attribution – Reza Ramezani Attribution Methods (Cont’d) • Profile-Based Approaches – Cumulatively (per author) • Concatenating all the available training texts per author in one big file (author’s profile) • The stylometric measures extracted from the concatenated file may be quite different in comparison to each of the original training texts. • Instance-Based Approaches – Individually (per author) 25
  • 26. Authorship Attribution – Reza Ramezani 1. Profile-Based Approaches 26
  • 27. Authorship Attribution – Reza Ramezani 1.1. Probabilistic Models 27
  • 28. Authorship Attribution – Reza Ramezani 1.2. Compression Models • Compression Models – Such methods do not produce a concrete vector representation of the author’s profile. • Steps – Initially a compression algorithm is called to produce a compressed file C(xa). – Then, the unseen text x is added to each text xa, and the compression algorithm is called again for each C(xa +x). – The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa) indicates the similarity of the unseen text with each candidate author. – These models are applied only to character sequences, not word sequences. 28
  • 29. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) 29
  • 30. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) (Cont’d) • Parameters – The CNG method has two important parameters that should be tuned: – The profile size L • How many strings constitute the profile. – And the character n-gram length n; • How long strings constitute the profile. – Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5. – The CNG distance function performs well when the training corpus is relatively balanced. – But it fails in imbalanced cases where at least one author’s profile is shorter than L. • CNG variant – To solve the problem of class imbalanced. 30
  • 31. Authorship Attribution – Reza Ramezani 2. Instance-Based Approaches 31
  • 32. Authorship Attribution – Reza Ramezani 2.1. Vector Space Models • Definition – It could be considered each text as a vector in a multivariate space. – Then, a variety of powerful statistical and machine learning algorithms can be used to build a classification model, including: • Discriminant Analysis • SVM • Decision Trees • Neural Networks • Genetic Algorithms • Memory-based Learners • Classifier Ensemble Methods • and so on. – Some of these algorithms can effectively handle high-dimensional, noisy, and sparse data, allowing more expressive representations of texts. – The effectiveness of methods is diminished by the presence of the class-imbalance. 32
  • 33. Authorship Attribution – Reza Ramezani 2.2. Similarity-based Models • Idea – Calculation of pairwise similarity measures between the unseen text and all the training texts, – And then estimating the most likely author based on a nearest-neighbor algorithm. • Example – Compression Model • Compressing of each training text in separate files using an off-the-shelf algorithm • C(x) is the bit-wise size of the compression of file x • The difference C(x +y) − C(x) indicates the similarity of a training text x with the unseen text y. 33
  • 34. Authorship Attribution – Reza Ramezani 2.3. Meta-learning Models • Definition – More complex algorithms specifically designed for authorship attribution. – The main goal is to use such meta-data to understand how automatic learning can become flexible in solving different kinds of learning problems: • Hence to improve the performance of existing learning algorithms. – The most interesting approach of this kind is the unmasking method. 34
  • 35. Authorship Attribution – Reza Ramezani Unmasking Method • Unmasking Method – In the unmasking method, For each unseen text, an SVM classifier is built to discriminate it from the training texts of each candidate author. – Thus, for n candidate authors, n classifiers for each unseen text is built. – Then, in an iterative procedure, a predefined amount of the most important features for each classifier is removed and the drop in accuracy is measured. – At the beginning, all the classifiers had more or less the same very high accuracy. – After a few iterations, the accuracy of the classifier that discriminates between the unseen text and the true author would be too low while the accuracy of the other classifiers would remain relatively high. – This happens because the differences between the unseen text and the other authors are manifold, so by removing a few features, the accuracy is not affected dramatically. 35
  • 36. Authorship Attribution – Reza Ramezani 3. Hybrid Approaches • Hybrid Approaches – Methods that borrow some elements from both profile-based and instance-based approaches. • Example – All the training text samples are represented separately, as it happens with the instance-based approaches. – The representation vectors for the texts of each author are feature-wisely averaged and produced a single profile vector for each author, as happens with the profile- based approaches. – The distance of the profile of an unseen text from the profile of each author is then calculated by a weighted feature-wise function. 36
  • 37. Authorship Attribution – Reza Ramezani 37

Editor's Notes

  1. Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.Inflection=نواخت، نوا
  2. oriental =مشرق زمین مانند چین1. A subtle or slight degree of difference, as in meaning, feeling, or tone; a gradation.2. Expression or appreciation of subtle shades of meaning, feeling, or tone: a rich artistic performance, full of nuance.
  3. Adverbial=عبارت قیدیprepositional=حرف اضافه
  4. Pragmatic=واقع گرایانه
  5. Nuances=ریزه کاری، آهنگ، فحواfarewells =خداحافظی indentation=تورفتگی
  6. با حذف ویژگی های مهم، دقت Classifierی که کلاس درست را نشان می دهد از همه بدتر می شود، زیرا صفات مهم در آن از همه موثرتر است.