Identifying medical persona from a social media post is of paramount importance for drug marketing and pharma-covigilance. In this work, we propose multiple approaches to infer the medical persona associated with a social media post. We pose this as a supervised multi-label text classification problem. The main challenge is to identify the hidden cues in a post that are indicative of a particular persona. We first propose a large set of manually engineered features for this task. Further, we propose multiple neural network based architectures to extract useful features from these posts using pre-trained word embeddings. Our experiments on thousands of blogs and tweets show that
the proposed approach results in 7% and 5% gain in F-measure
over manual feature engineering based approach for blogs and
tweets respectively.
1. Medical Persona Classification in Social Media
Nikhil Pattisapu1
, Manish Gupta1,2
, Ponnurangam
Kumaraguru3
, Vasudeva Varma1
1IIIT Hyderabad
2Microsoft India
3IIIT Delhi
Advances in Social Network Analysis and Mining 2017
ASONAM 2017 1 / 30
3. Motivation
What is Medical Persona?
User groups and content providers of Web 2.0 applications in
healthcare. Some examples -
Patient
Caretaker
Consultant
Journalist
Pharmacist
Researcher
Other
ASONAM 2017 3 / 30
4. Motivation
Pharmaceutical firms use Medical social media for Drug
marketing and pharmacovigilance.
Figure: Sample post from drugs.com describing a patient’s experiences
with the drug Keppra.
ASONAM 2017 4 / 30
5. Motivation
Use cases
Few use cases for identifying medical persona are mentioned below.
To gather information about drug usage, adverse events,
benefits and side effects from patients.
To find out the kind of informational assistance sought by
caretakers and put such information readily available.
To identify key opinion leaders in a drug or disease area.
To find out if a doctor has patients who can take part in a
clinical trial.
ASONAM 2017 5 / 30
6. Motivation
Use cases
To gather information on conversations between pharmacists
and others to identify drug dosage, interactions and
therapeutic effects.
To acquire or collaborate on technologies invented by
researchers that can be a part of the drug pipeline.
To gather information about journalists’ survey on quality of
life of patients.
ASONAM 2017 6 / 30
7. Problem Definition
Given a social media post, identify the medical personae associated
with it.
We pose this as multi-label text classification problem, where our
label set is {Patient, Caretaker, Consultant, Journalist, Pharmacist,
Researcher, Other}
There are two primary reasons for setting this as a multi-label
classification task (as opposed to single-label)
There might be posts involving conversations between
multiple personae. For example, a blog describing
patient-consultant conversation.
A post might be of ambiguous nature and hence can
potentially be mapped to more than one label by a human
annotator.
ASONAM 2017 7 / 30
8. Related Work
This problem is primarily related to two problems, which are
thoroughly studied in literature
Authorship Attribution - The task of determining the author
of a particular document
Automatic Genre Identification (AGI) - The task of classifying
documents based on genres (which includes their form,
structure, functional trait, communicative purpose, targeted
audience and narrative style) rather than the content, topics
or subjects that the documents span.
ASONAM 2017 8 / 30
9. Related Work
State-of-the-art Methods
For both, authorship attribution and AGI, supervised algorithms
based on extensive feature engineering have been proposed. The
top features include
Word n-grams
Character n-grams
Common words
Function words
Part-of-speech tags
Document statistics (e.g. document length)
HTML tags.
Stylistic features
Acronyms
Hashtag and reply mentions.
ASONAM 2017 9 / 30
10. Related Work
Why can’t existing methods be trivially adapted?
Different features need to be explored for medical domain.
As opposed to most methods proposed in literature, our task
is of closed-set multi-label type.
Each persona has several users and will itself contain
heterogeneity.
ASONAM 2017 10 / 30
11. Dataset
Blog / Tweet Search API
Noise Filtering &
Deduplication
Human
Annotation
Query
Blogs / Tweets Labeled
Blogs / Tweets
Figure: Dataset Collection
Our dataset consists of both blogs as well as tweets.
Examples of queries include drug names - minocycline, qvar,
gilenya
Whenever using only drugs as queries resulted in a lot of
irrelevant content, drug-disease pairs (e.g. acne minocycline)
were used as queries.
We used 50 queries and retrieved 50 blogs and 30 tweets per
query.
Noisy posts, retweets were removed.
ASONAM 2017 11 / 30
12. Dataset
Figure: Dataset Statistics
1581 blogs and 1025 tweets were annotated
The inter-annotator agreement between 4 annotators was
found to be 0.708 for blogs and 0.70 for tweets.
The label cardinality of blogs and tweets was 1.18 and 1.24
respectively.
The maximum label cardinality of a blog was 2 and that of a
tweet was 3.
ASONAM 2017 12 / 30
13. Approach
Overview
We first transform the multi-label task into one or more single
label-task using
Binary label transformation
Label powerset transformation
We then use the following approaches to solve this task
N-gram approach
Feature Engineering
Averaged Word Vectors
CNN-LSTM
ASONAM 2017 13 / 30
14. Approach
Label transformation method
Binary Relevance Method
We train an individual classifier for each label.
Given an unseen sample, the combined model then predicts all
labels for this sample for which the respective classifiers
predict a positive result.
Label Powerset Method
We train one binary classifier for every label combination
attested in the training set
For an unseen example, prediction is done using a voting
scheme.
ASONAM 2017 14 / 30
15. Approach
N-gram approach (Baseline)
Each document is represented as a TF-IDF vector over the
entire vocabulary.
An SVM is trained to classify the document into one or more
of the pre-defined personae.
Both Word n-grams and character n-grams are used.
Averaged word Vectors
document vector(di ) =
wi
j
word embedding(wi
j )
len(pi )
(1)
ASONAM 2017 15 / 30
16. Approach
Word Embedding Details
ID Training
Source
Training
Algo-
rithm
#Dim #Entries Domain
1 Medical
Tweets
(ADR)
Word2Vec 200 1344629 Medical
2 Twitter GloVe 200 1193515 Generic
3 Web crawl 1 GloVe 300 2196018 Generic
4 Web crawl 2 GloVe 300 1917495 Generic
5 PubMed,
PMC,
Wikipedia
Word2Vec 200 5443656 Medical
Table: Pre-trained Word Embedding Details
ASONAM 2017 16 / 30
17. Approach
Feature Engineering
For this task, we manually engineered a total of 89 features,
distributed in 6 feature types.
Document Level features (4)
Captures generic features of a post
Examples - Number of sentences, average sentence length,
average word length
Pharmacist blogs are lengthier than Patient blogs.
POS features (33)
Capture the distribution of different Parts-of-Speech in the
document.
Example - Number of Adjectives
A Consultant is 1.6 times more likely to use adjectives than a
journalist.
ASONAM 2017 17 / 30
18. Approach
Feature Engineering
List lookup features (7)
Include the average frequency of terms which occur in the
document as well as in a particular list.
Example - List of abusive words.
The terms MD, Dr., MBBS, FRCS, consultation fee, were
found to be more frequent in consultant blogs than others.
Syntactic features (7)
Capture the presence or absence of various classes of terms.
Example - date, person, location, organization, time, money,
and percentage amounts.
Researcher blogs contain more percentage mentions than
others.
ASONAM 2017 18 / 30
19. Approach
Feature Engineering
Semantic features (35)
Consist of a lot of medical domain specific features
Examples - number of disease mentions, drug mentions,
chemical mentions, organ mentions
The distribution across these features gives significant clues
about the persona.
These features were extracted using MetaMap.
Tweet specific features
Consist features specific to tweets only
Examples - number of hashtags
ASONAM 2017 19 / 30
20. Approach
CNN Architecture
For experiments related to tweets, we use the following CNN
architecture
Softmax / Sigmoid
Convolution
Layer
Max-pooling
Layer
Pre-trained Word
Embedding Layer
I am suffering pneumonia
Figure: CNN
ASONAM 2017 20 / 30
21. Approach
CNN-LSTM Architecture
For experiments related to blogs, we use the following CNN-LSTM
architecture
LSTM LSTM LSTM
Softmax / Sigmoid
Layer
Convolution
Layer
Max-pooling
Layer
Pre-trained Word
Embedding Layer
Sequential
Layer
I treated a patient He was suffering fever Hygiene highly impacts dengue
Figure: CNN-LSTMASONAM 2017 21 / 30
22. Evaluation Metrics
Each evaluation metric is described on a per instance basis which
is subsequently averaged over all instances to obtain the aggregate
value.
Let l and pr be the true label set and predicted label set for
document d
Exact Match =
1 if l = pr
0 otherwise
(2)
Jaccard Similarity = |l ∩ pr|/|l ∪ pr| (3)
Precision = |l ∩ pr|/|pr| (4)
Recall = |l ∩ pr|/|l| (5)
F − Score = 2 ∗ Precision ∗ Recall/(Precision + Recall) (6)
ASONAM 2017 22 / 30
23. Evaluation Metrics
Hamming Loss =
|L|
j=1 xor(lj , prj )
|L|
(7)
Hamming Score = 1 − Hamming Loss (8)
where lj , prj denote jth element of l and pr respectively.
ASONAM 2017 23 / 30
24. Experimental Details
Throughout this work, we conduct 10 fold cross validation
experiments.
For extracting semantic features we use MetaMap.
For tuning hyperparameters in CNN and CNN-LSTM models,
we used a grid search over the entire hyper-parameter space
which includes
Number of convolution filters
Filter sizes
Activation Functions (ReLU and sigmoid)
Size of hidden layer
Number of epochs
We select the configuration which maximizes the F-Score on a
hold-out validation set.
ASONAM 2017 24 / 30
27. Analysis
Feature Analysis
Feature
Group
Best Feature (Blogs) Best Feature (Tweets)
Document # characters (3) # characters (8)
Syntactic # Money mentions (2) # Money mentions (6)
List lookup # matching words with
consultant list (1)
# matching words with
patient word list (29)
Semantic # Inorganic chemical (38) # research activity (34)
POS # Foreign word (163) # Personal Pronoun
(116)
Tweet
specific
- # hashtags (9)
Table: Feature Analysis for Blogs and Tweets based on χ2
metric.
Number in the parenthesis indicates feature rank (lesser the better)
ASONAM 2017 27 / 30
28. Analysis and Conclusion
Averaged word2vec (for blogs), CNN model (for tweets)
outperforms other approaches.
CNN-LSTM model fails to outperform averaged word2vec
method, mainly due to the high number of trainable model
parameters
Word embeddings with superior medical concept coverage do
not perform well against others. [May be coverage is not very
crucial for this task.]
Word embeddings trained purely on medical text (like
PubMed articles) do not outperform others.
Lack of diversity of persona in training data
Most of the data is generated by few personae (like researchers
for PubMed)
ASONAM 2017 28 / 30
29. Future Work
The current features are limited to a posts content, we would
like to explore other features
like social features, for example, number of followers on Twitter
We wish to experiment with distant supervision based
methods to get automatically labeled examples for data
hungry models like CNN-LSTM.
ASONAM 2017 29 / 30
30. Thank You !!
For any queries, please contact nikhil.pattisapu@research.iiit.ac.in