In this paper, we present two approaches to automatically classify the gender of blog authors: the first is a manual feature extraction based system incorporating two novel feature classes: variable length character sequence patterns and thirteen new word classes, along with an added class of surface features while the second is a first-ever application of a memory variant of Recurrent Neural Networks, i.e. Bidirectional Long Short Term Memory Networks (BLSTMs) on this task. We use two blog data sets to report our results: the first is a well-explored one used by the previous state-of-the-art models while the other is a 20 times larger corpus. For the first system, we use a voting of machine learning classifiers to obtain an improved accuracy with respect to the previous feature mining systems on the former data set. Using our second approach, we show that the accuracy obtained using such deep LSTMs is comparable to the current state-of-the-art deep learning system for the task of gender classification. Finally, we carry out a comparative study of performance of both the systems on the two data sets.
Gender Classification of Blog Authors: With Feature Engineering and Deep Learning using LSTM Networks
1. Gender Classification of Blog Authors: With Feature
Engineering and Deep Learning using LSTM Networks
● Vijay Prakash Dwivedi
● Saurav Jha
● Deepak Kumar Singh
● Ranvijay
Computer Science and Engineering Department
Motilal Nehru National Institute of Technology Allahabad Date: 14 December, 2017
2. Objective
● Automatically classify the gender of blog authors
● Extend the current feature set for classification
● Study and build a non-feature based model
3. Overview of the problem
● Writing styles of authors are largely affected by their gender.
● Prediction can be done based on these characteristics
4. Our Contribution
● Two approaches for the task
● First: We propose 2 novel feature classes
● Second: First-ever application of time-sequence based LSTM
Networks
5. Novel Feature Classes
1. Variable length
Character Sequence
Pattern
2. 13 new Word Factors
1. Variable length Character Sequence Pattern Features
• A character n-gram refers to a contiguous sequence of
n characters from a given sequence of text.
6. Novel Feature Classes
1. Variable length
Character Sequence
Pattern
2. 13 new Word Factors
2. Word Factor Analysis and Classes
• Finding groups of similar words differentiating gender
writing styles
7. Novel Feature Classes
1. Variable length
Character Sequence
Pattern
2. 13 new Word Factors
2. Word Factor Analysis and Classes
• Finding groups of similar words differentiating gender
writing styles
8. Added Surface Features
3. Surface Features
● Normalized count of sentences: In the blog.
● Normalized count of words: In the blog.
● Normalized count of characters: The total count of characters averaged
over all sentences in the blog.
● Normalized count of alphabets: The average number of alphabets per
sentence present in the blog.
● Normalized count of digits: The average count of digits per sentence in the
blog.
● Normalized count of special characters: ’@’, ’#’, ’$’, ’%’, ’&’, ’*’, ’-’, ’=’, ’+’,
’¿’, ’¡’, ’[’, ’]’, ’{’, ’}’, ’/’.
● Normalized count of punctuation marks: The ratio of count of punctuations
to the total count of characters in the blog.
● Count of short words: In the blog (words with four or lesser characters).
● Average word length: The ratio of the sum total of characters in all words
to the count of words in the blog.
● Average sentiment score: The positive and negative
sentiment score of the blog averaged over all the words based on the
SentiWordNet 3.0 lexical resource [18].
● Lexical richness: The lexical richness of the reference sentence based on
Yule’s K index.
● Features based on
linguistic patterns and
morphology of the text
9. Existing Feature Classes
4. POS Sequence Patterns
5. Stylistic Features
6. F-Measure
7. Gender Preferential
Features
4. POS Sequence Pattern Features:
● Originally proposed by Mukherjee and Liu, 2010
● Mining POS Sequences of varying length (2 to 6) satisfying minsup and
minadherence constraints
● We use the same POS sequence pattern mining algorithm as stated by
Mukherjee and Liu [2] in their work.
5. Stylistic Features:
● Captures the author’s writing style using three types of features: parts of
speech (POS), words and in the blog context, words such as lol, hmm and
smiley.
6. F-measure:
● F = 0.5 ∗ [(f.noun + f.adj + f.prep + f.art)
− (f.pron + f.verb + f.adv + f.int) + 100]
● Males, introverted and academically educated authors score a high F-
measure preferring a more formal style.
10. Existing Feature Classes
4. POS Sequence Patterns
5. Stylistic Features
6. F-Measure
7. Gender Preferential
Features
7. Gender Preferential Features:
● Women’s language make more frequent use of emotionally
intensive adverbs and adjectives like ‘so’, ‘terribly’, ‘awfully’,
‘dreadfully’ and is more punctuated while men’s language more
often express a sense of ‘independence’.
11. Deep Learning based Model
Why Deep Learning?
- Ability to extract abstract features on their own
- Have yielded state-of-the-art results on various machine learning applications.
- Recurrent Neural Networks (RNN) and especially, Long short-term memory (LSTM)
networks perform well on sequential data, e.g. texts and speech.
12. LSTM Cell
4 Components to capture sequential information
• Input Gate it
• Output Gate ot
• Forget Gate ft
• Memory State Ct
Fig. Block Diagram of LSTM Cell
13. Bidirectional LSTM and Model Architecture
Fig. BLSTM Model Architecture
• BLSTM Cells help to
learn bidirectional
sequential information
15. Data sets
1. Gender Classification Blog Dataset introduced by Mukherjee and
Liu, 2010
2. The Blog Authorship Corpus
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
16. Results
● Trained and tested on the
same dataset used by
LM-10, for ease of
comparison
RESULTS COMPARISON OF THE NON-DEEP LEARNING SYSTEM
BASED ON FEATURES EXTRACTION
* LM-10: Mukherjee and Liu, 2010
17. Results
● Trained and tested on the
same dataset used by
LM-10
EFFECT OF FEATURE SELECTION SETTINGS
* LM-10: Mukherjee and Liu, 2010
18. Results
● Deep Learning BLSTM
Model trained on The
Blog Authorship Corpus
● Graph showing the
increasing trend in
accuracy achieved for
different data set sizes as
we train on more data.
Results of the BLSTM Model
19. Results
● Deep Learning BLSTM
Model trained on The
Blog Authorship Corpus
PERFORMANCE COMPARISON WITH OTHER DEEP LEARNING
MODELS
•RCNN: Recurrent Convolutional Neural Network
•WRCNN: Windowed RCNN
both by Bartle and Zheng, 2015
21. References
● E. Wilson, A. Kenny, and V. Dickson-Swift, “Using blogs as a qualitative health research tool,” International Journal of Qualitative Methods, vol. 14, no. 5, p. 1609406915618049,
2015. [Online]. Available: https://doi.org/10.1177/1609406915618049
● A. Mukherjee and B. Liu, “Improving gender classification of blog authors,” in EMNLP, 2010. [
● S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Automatically profiling the author of an anonymous text,” Commun. ACM, vol. 52, no. 2, pp. 119–123, Feb. 2009. [Online].
Available: http://doi.acm.org/10.1145/1461928.1461959
● J. Schler, M. Koppel, S. Argamon, and J. W. Pennebaker, “Effects of age and gender on blogging,” in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs,
2006.
● S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Mining the blogosphere: Age, gender and the varieties of self-expression,” First Monday, vol. 12, 2007.
● X. Yan and L. Yan, “Gender classification of weblog authors,” in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006
● S. Nowson, J. Oberlander, and A. J. Gill, “Weblogs, genres, and individual differences,” 2005.
● J. D. Burger, J. C. Henderson, G. Kim, and G. Zarrella, “Discriminating gender on twitter,” in EMNLP, 2011.
22. References
● A. Bartle and J. Zheng, “Gender classification with deep learning,” 2015.
● S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in AAAI, 2015.
● P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classification with multi-task learning,” in IJCAI, 2016. [
● R. Johnson and T. Zhang, “Supervised and semi-supervised text categorization using lstm for region embeddings,” in ICML, 2016.
● J. Y. Lee and F. Dernoncourt, “Sequential short-text classification with recurrent and convolutional neural networks,” in HLT-NAACL, 2016.
● Y. Zhou, B. Xu, J. Xu, L. Yang, and C. Li, “Compositional recurrent neural networks for chinese short text classification,” 2016 IEEE/WIC/ACM International Conference on Web
Intelligence (WI), pp. 137–144, 2016.
● I. Kanaris, K. Kanaris, and I. Houvardas, “Words vs. character n-grams for anti-spam filtering,” 2006.
● D. H. Fusilier, M. M. y Gomez, P. Rosso, and R. Guzm´ an-Cabrera,´ “Detection of opinion spam with character n-grams,” in CICLing, 2015.
23. References
● C. K. Chung and J. W. Pennebaker, “Revealing dimensions of thinking in open-ended self-descriptions: An automated meaning extraction method for natural language.” Journal of
research in personality, vol. 42 1, pp. 96–132, 2008.
● S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining,” in LREC, 2010.
● H. Baayen, H. van Halteren, and F. Tweedie, “Outside the cave of shadows: using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11,
no. 3, pp. 121–132, 1996. [
● E. T. Hall and F. Heylighen, “Francis heylighen and jean-marc dewaele variation in the contextuality of language: An empirical measure,” 2002.
● M. Corney, O. de Vel, A. Anderson, and G. Mohay, “Genderpreferential text mining of e-mail discourse,” in 18th Annual Computer Security Applications Conference, 2002.
Proceedings, 2002, pp. 282– 289. [
● J. Schmidhuber, “Deep learning in neural networks: An overview,” vol. 61, pp. 85–117, 2015.
● D. Chen and C. D. Manning, “A fast and accurate dependency parser using neural networks,” in EMNLP, 2014.
● D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
24. References
● J. Schmidhuber, “My first deep learning system of 1991 + deep learning timeline 1962-2013,” CoRR, vol. abs/1312.5548, 2013.
● S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, 1997.
● Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” CoRR, vol. abs/1508.01991, 2015
● C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith, “Transition-based dependency parsing with stack long short-term memory,” in ACL, 2015.
● N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning
Research, vol. 15, pp. 1929– 1958, 2014.