Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks

Gender Classification of Blog Authors: With Feature
Engineering and Deep Learning using LSTM Networks
● Vijay Prakash Dwivedi
● Saurav Jha
● Deepak Kumar Singh
● Ranvijay
Computer Science and Engineering Department
Motilal Nehru National Institute of Technology Allahabad Date: 14 December, 2017

Objective
● Automatically classify the gender of blog authors
● Extend the current feature set for classification
● Study and build a non-feature based model

Overview of the problem
● Writing styles of authors are largely affected by their gender.
● Prediction can be done based on these characteristics

Our Contribution
● Two approaches for the task
● First: We propose 2 novel feature classes
● Second: First-ever application of time-sequence based LSTM
Networks

Novel Feature Classes
1. Variable length
Character Sequence
Pattern
2. 13 new Word Factors
1. Variable length Character Sequence Pattern Features
• A character n-gram refers to a contiguous sequence of
n characters from a given sequence of text.

Novel Feature Classes
1. Variable length
Character Sequence
Pattern
2. 13 new Word Factors
2. Word Factor Analysis and Classes
• Finding groups of similar words differentiating gender
writing styles

Added Surface Features
3. Surface Features
● Normalized count of sentences: In the blog.
● Normalized count of words: In the blog.
● Normalized count of characters: The total count of characters averaged
over all sentences in the blog.
● Normalized count of alphabets: The average number of alphabets per
sentence present in the blog.
● Normalized count of digits: The average count of digits per sentence in the
blog.
● Normalized count of special characters: ’@’, ’#’, ’$’, ’%’, ’&’, ’*’, ’-’, ’=’, ’+’,
’¿’, ’¡’, ’[’, ’]’, ’{’, ’}’, ’/’.
● Normalized count of punctuation marks: The ratio of count of punctuations
to the total count of characters in the blog.
● Count of short words: In the blog (words with four or lesser characters).
● Average word length: The ratio of the sum total of characters in all words
to the count of words in the blog.
● Average sentiment score: The positive and negative
sentiment score of the blog averaged over all the words based on the
SentiWordNet 3.0 lexical resource [18].
● Lexical richness: The lexical richness of the reference sentence based on
Yule’s K index.
● Features based on
linguistic patterns and
morphology of the text

Existing Feature Classes
4. POS Sequence Patterns
5. Stylistic Features
6. F-Measure
7. Gender Preferential
Features
4. POS Sequence Pattern Features:
● Originally proposed by Mukherjee and Liu, 2010
● Mining POS Sequences of varying length (2 to 6) satisfying minsup and
minadherence constraints
● We use the same POS sequence pattern mining algorithm as stated by
Mukherjee and Liu [2] in their work.
5. Stylistic Features:
● Captures the author’s writing style using three types of features: parts of
speech (POS), words and in the blog context, words such as lol, hmm and
smiley.
6. F-measure:
● F = 0.5 ∗ [(f.noun + f.adj + f.prep + f.art)
− (f.pron + f.verb + f.adv + f.int) + 100]
● Males, introverted and academically educated authors score a high F-
measure preferring a more formal style.

Existing Feature Classes
4. POS Sequence Patterns
5. Stylistic Features
6. F-Measure
7. Gender Preferential
Features
7. Gender Preferential Features:
● Women’s language make more frequent use of emotionally
intensive adverbs and adjectives like ‘so’, ‘terribly’, ‘awfully’,
‘dreadfully’ and is more punctuated while men’s language more
often express a sense of ‘independence’.

Deep Learning based Model
Why Deep Learning?
- Ability to extract abstract features on their own
- Have yielded state-of-the-art results on various machine learning applications.
- Recurrent Neural Networks (RNN) and especially, Long short-term memory (LSTM)
networks perform well on sequential data, e.g. texts and speech.

LSTM Cell
4 Components to capture sequential information
• Input Gate it
• Output Gate ot
• Forget Gate ft
• Memory State Ct
Fig. Block Diagram of LSTM Cell

Bidirectional LSTM and Model Architecture
Fig. BLSTM Model Architecture
• BLSTM Cells help to
learn bidirectional
sequential information

Experiments
• Two datasets used to train, validate and test both the models

Data sets
1. Gender Classification Blog Dataset introduced by Mukherjee and
Liu, 2010
2. The Blog Authorship Corpus
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

Results
● Trained and tested on the
same dataset used by
LM-10, for ease of
comparison
RESULTS COMPARISON OF THE NON-DEEP LEARNING SYSTEM
BASED ON FEATURES EXTRACTION
* LM-10: Mukherjee and Liu, 2010

Results
● Trained and tested on the
same dataset used by
LM-10
EFFECT OF FEATURE SELECTION SETTINGS
* LM-10: Mukherjee and Liu, 2010

Results
● Deep Learning BLSTM
Model trained on The
Blog Authorship Corpus
● Graph showing the
increasing trend in
accuracy achieved for
different data set sizes as
we train on more data.
Results of the BLSTM Model

Results
● Deep Learning BLSTM
Model trained on The
Blog Authorship Corpus
PERFORMANCE COMPARISON WITH OTHER DEEP LEARNING
MODELS
•RCNN: Recurrent Convolutional Neural Network
•WRCNN: Windowed RCNN
both by Bartle and Zheng, 2015

Results
● Performance comparison
of both the systems on
both the datasets
Feature Extraction based v/s Deep Learning Model

References
● E. Wilson, A. Kenny, and V. Dickson-Swift, “Using blogs as a qualitative health research tool,” International Journal of Qualitative Methods, vol. 14, no. 5, p. 1609406915618049,
2015. [Online]. Available: https://doi.org/10.1177/1609406915618049
● A. Mukherjee and B. Liu, “Improving gender classification of blog authors,” in EMNLP, 2010. [
● S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Automatically profiling the author of an anonymous text,” Commun. ACM, vol. 52, no. 2, pp. 119–123, Feb. 2009. [Online].
Available: http://doi.acm.org/10.1145/1461928.1461959
● J. Schler, M. Koppel, S. Argamon, and J. W. Pennebaker, “Effects of age and gender on blogging,” in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs,
2006.
● S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Mining the blogosphere: Age, gender and the varieties of self-expression,” First Monday, vol. 12, 2007.
● X. Yan and L. Yan, “Gender classification of weblog authors,” in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006
● S. Nowson, J. Oberlander, and A. J. Gill, “Weblogs, genres, and individual differences,” 2005.
● J. D. Burger, J. C. Henderson, G. Kim, and G. Zarrella, “Discriminating gender on twitter,” in EMNLP, 2011.

References
● A. Bartle and J. Zheng, “Gender classification with deep learning,” 2015.
● S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in AAAI, 2015.
● P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classification with multi-task learning,” in IJCAI, 2016. [
● R. Johnson and T. Zhang, “Supervised and semi-supervised text categorization using lstm for region embeddings,” in ICML, 2016.
● J. Y. Lee and F. Dernoncourt, “Sequential short-text classification with recurrent and convolutional neural networks,” in HLT-NAACL, 2016.
● Y. Zhou, B. Xu, J. Xu, L. Yang, and C. Li, “Compositional recurrent neural networks for chinese short text classification,” 2016 IEEE/WIC/ACM International Conference on Web
Intelligence (WI), pp. 137–144, 2016.
● I. Kanaris, K. Kanaris, and I. Houvardas, “Words vs. character n-grams for anti-spam filtering,” 2006.
● D. H. Fusilier, M. M. y Gomez, P. Rosso, and R. Guzm´ an-Cabrera,´ “Detection of opinion spam with character n-grams,” in CICLing, 2015.

References
● C. K. Chung and J. W. Pennebaker, “Revealing dimensions of thinking in open-ended self-descriptions: An automated meaning extraction method for natural language.” Journal of
research in personality, vol. 42 1, pp. 96–132, 2008.
● S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining,” in LREC, 2010.
● H. Baayen, H. van Halteren, and F. Tweedie, “Outside the cave of shadows: using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11,
no. 3, pp. 121–132, 1996. [
● E. T. Hall and F. Heylighen, “Francis heylighen and jean-marc dewaele variation in the contextuality of language: An empirical measure,” 2002.
● M. Corney, O. de Vel, A. Anderson, and G. Mohay, “Genderpreferential text mining of e-mail discourse,” in 18th Annual Computer Security Applications Conference, 2002.
Proceedings, 2002, pp. 282– 289. [
● J. Schmidhuber, “Deep learning in neural networks: An overview,” vol. 61, pp. 85–117, 2015.
● D. Chen and C. D. Manning, “A fast and accurate dependency parser using neural networks,” in EMNLP, 2014.
● D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.

References
● J. Schmidhuber, “My first deep learning system of 1991 + deep learning timeline 1962-2013,” CoRR, vol. abs/1312.5548, 2013.
● S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, 1997.
● Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” CoRR, vol. abs/1508.01991, 2015
● C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith, “Transition-based dependency parsing with stack long short-term memory,” in ACL, 2015.
● N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning
Research, vol. 15, pp. 1929– 1958, 2014.

Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks

Similar to Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks (20)

Recently uploaded

Recently uploaded (20)

Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks