The document presents a model to predict question quality in community question answering sites. It aims to predict user satisfaction and question quality in both the online and offline scenarios. In the online scenario, it uses features from question text and the asker's profile, while in the offline scenario it adds features from community responses. Experimental results show that predicting satisfaction achieves 70% accuracy using logistic regression with additional text features. Community interaction features are more predictive than question content features alone. The model performs better at predicting unsatisfied questions.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Seshadri ML Report
1. Predicting Question Quality in Community Question
Answering Sites
Juan M. Caicedo C. Seshadri Sridharan Aarti Singh
AndrewID: jcaicedo AndrewID: seshadrs (Project mentor)
jcaicedo@cs.cmu.edu seshadrs@andrew.cmu.edu
Abstract
We present a model to predict question quality, an abstract measure, using only the
content level features available at the time a new question is posted. We first pre-
dict asker satisfaction using preexisting labels, and then predict different aspects
of question quality using human annotations. For the former task, we use features
from question text and community interaction to improve on the baseline model
of Liu et al. For the latter, we hypothesize that question content and community
response can independently model question quality, and enrich the content based
model using co-training.
1 Introduction
Community Question Answering (CQA) web sites allow the users to post questions, submit an-
swers and interact with other users by voting, writing comments or by using other mechanisms of
participation. These sites have become a popular source for seeking information of different kinds,
both for general topics, such as Yahoo Answers or Answer Bag, and for more specialized ones, like
Quora and StackOverflow. One key element to the success of a CQA site is the quality of the con-
tent generated by the community of users. Particularly, the quality of the questions affects directly
the relevance of the content, the willingness of the community to participate and the likelihood that
visitors to the site want to engage in the process. For this reason, we think that it is important to
understand the factors that affect the quality of questions and, if possible, to be able to assess its
quality automatically.
Detecting the quality of the questions also benefits the users of a CQA site. First, the askers can
know in advance if the questions they ask will be graded as high quality. This would allow them to
learn to ask better questions and, ideally, improve the satisfaction that they get from the site. Second,
the moderators of a CQA can monitor the quality of the recently posted questions; this would allow
them to detect and improve those that are low quality and to highlight the high quality questions so
that they receive more attention by the community. We call the first application the online scenario,
when the question is being asked, and the second one the offline scenario, after the question has
been posted and the community has started to participate.
Although several problems of CQA have been addressed using diverse machine learning techniques,
predicting question quality poses challenges that have not been covered in much detail. The main
difficulty arises from the nature of the two possible applications. In the offline scenario, machine
learning algorithms can use features extracted from the community reaction to the question, which
is a reliable indicator of the quality of the content, whereas in the online scenario this information
is not available and the algorithms have to rely on the asker’s profile and the text of the question,
which requires techniques from NLP to extract informative features about the quality.
In this project we present a model for predicting the question quality in the online scenario. First,
we extend the existing work for predicting asker satisfaction [3] and we test its applicability across
a different dataset. Second, we improve it by using richer linguistic features extracted from the
1
2. question content. Then, after showing the high predictability of the models on this task, we move to
the related problem of predicting question quality. For this task we use manually labeled questions
to train the models again. To overcome the problem of labeling a large set of questions, we use
co-training to generate more training instances that allow us to improve both models.
2 Related Work
The interest in CQA sites has also increased within research areas related to information retrieval.
Much of that work has focused on content ranking and recommendation, content analysis, social
network analysis, user modeling and quality evaluation. [1] and [3] present an overview of the
research done in those areas. We will discuss here the works that are closely related to our tasks.
A framework for automatically classifying content of high quality in social media sites is described
in [1], where the quality is modeled in terms of the content itself and of the relations between the
user and the generated content and its usage statistics. However, they see the features extracted from
the content and the features derived from the community as complementary, and they do not study
the differences of the online and offline scenarios.
The problem of predicting user satisfaction is studied in [3]. It can be argued that modeling user
satisfaction can be used as an approximate measure of the quality of the content they create. We
believe that the satisfaction of an asker depends on the response from the community generated by
his or her questions, which depends in turn on the quality of the question itself. Liu et al. present a
prediction model that uses features based on the content and the community structure and evaluate it
the two scenarios that we are considering. We extend their work by using richer text based features
and exploiting additional interactions from the community. In [5], Shah and Pomerantz present a
study where they train a classifier that accurately predicts the quality of an answer based on human
judgment. We take a subset of the criteria used by the human judges that participated in their study,
and we use it to assess the quality of the questions.
3 Predicting Question Quality
The problem of assessing the quality of a question is subjective, since it depends on several factors
that can vary in the context of the evaluation. For this reason, we decide to address two related tasks:
(1) predict the asker satisfaction as an indicator of the question quality and (2) predict the quality
assessments assigned by humans.
Predicting Asker Satisfaction: We define that an asker is satisfied if he selects one of the posted
answers he received as the best one for his question; additionally, this answer must have at least 2
votes. We based this definition on the proposed by Liu et al. [3], and we adapt it to the data that we
have for this task. We add the constraint on the number of votes to also consider the judgment of the
community. Thus, we have a binary classification task where we have to predict whether the asker
of a question was satisfied or not.
Predicting Question Quality based on human assessments We use human judgments to assess the
question quality based on five aspects of question quality: readability, conciseness, detail, politeness
and appropriateness. They are a subset of the criteria used by [5] to measure answer quality on a
CQA site. The questions were annotated giving a value on a 1 to 5 scale for each of the selected
criteria and we aggregate them to define an indicator of the overall question quality. Under this
aggregated measure, we define that a high quality question has values greater than or equal to three
for at least 3 of the criteria. This is again a binary classification task, where our target label is this
aggregated measure.
4 Task 1: Predicting Asker Satisfaction
4.1 Experimental Setup
In this section we present the experimental setup for the asker satisfaction task. We describe the
dataset, features, classification algorithms and the evaluation metrics used for each of the experi-
ments.
2
3. Online Features Offline Features
Question Content Features Community Interaction Features
Title length Question favourite count
Content punctuation density Question’s community score
Text spacing density Number of question revisions
Content body length Question new tag change count
Code block counts, total length Community Answers Features
Time(hour) posted Answers count
Tag count Answers score max
Extended Question Content Features Answers score total
Text misspelling ratio Best Answer body length
Text capitalization ratio Best Answer body spacing density
Text blacklisted word count Accepted count
Words per sentence Accepted ratio
Uppercase word length ratio Answers to question ratio
Number of sentences Answers reputation
Text similarity with the text of questions Answerers Profile Features
where the user is satisfied. Average Answerer membership age
Similarity of the sequence of POS tags with the Most voted, Most reputed answerer’s answer accepted answer count
questions where the user is satisfied. Most voted, Most reputed answerer’s answer acceptance ratio
Asker Profile Features Most voted, Most reputed answerer’s reputation
Answers to Questions ratio Most voted, Most reputed answerer’s question solved count
Answers received
Membership age
Solved Question count
Average past score
Recent past score
Table 1: Feature classes and their features
Dataset: Our dataset is based on the Stack Exchange network of CQA sites 1 . It contains 2.2 mil-
lion of questions and 4.8 million of answers, along with the complete information for the questions,
the answers posted, the selected answer by the asker, the user information (askers and answerers)
and the community response (votes, comments and modifications) for 35 of their sites. We selected
StackOverflow, a site dedicated to computer programming. For our experiments we randomly se-
lected 5,000 questions that were at least 2 month old and their corresponding 10,902 answers. There
are 1,734 (34.68 %) questions where the user is satisfied; this distribution is similar to the one of the
original dataset (33.72 %).
Features: We use two sets of features corresponding to the scenarios of our task. For the online
scenario, we extract features from the text of the question and the profile of the asker; for the offline
scenario, we add features from the answers posted and the reaction of the community to the question.
As our baseline model we use the features proposed in [3]; we extend them by adding more richer
features extracted from the text. Table 1 presents the list of the features used.
Algorithms: We trained three classifiers based on decision trees, logistic regression and Naive
Bayes. We chose these algorithms since they have been used successfully in CQA related problems.
We are also particularly interested in using decision trees for their readability, given the application
of our task. We used the algorithm implementations provided in Scikit-learn toolkit [4].
Metrics: We report the overall accuracy of the classifiers, along with the averaged measures of
precision, recall and the F1 score over the two classes. We perform 10-fold cross-validation over
5,000 training instances.
4.2 Experiment Results
We present in this section the results of the predicting asker satisfaction task. First, we evaluated the
performance of the algorithms varying the number of training instances and, for decision trees, vary-
ing their parameters. We then evaluated the best algorithm, logistic regression, using the different
sets of features for each of the scenarios. Finally, we report the features with higher predictability
according to their information gain.
Algorithm evaluation: Since we are mainly interested in the online scenario, we compared the
algorithms using the corresponding set of features. In the case of decision trees, we evaluated first
the maximum depth of the learned tree in order to choose the appropriate value for our task. Figure
1 presents how the complexity of the tree affects the accuracy of the training and the test set. We
chose a maximum depth of 8 for the rest of our experiments. The algorithm performs poorly below
1
http://stackexchange.com/
3
4. Features Accuracy Precision Recall F1
Baseline offline 0.8177 0.8028 0.6411 0.7126
Offline 0.8175 0.7994 0.6451 0.7135
Baseline online 0.6841 0.5822 0.3747 0.4544
Online 0.6887 0.5886 0.3912 0.4692
Question only 0.6607 0.5337 0.2976 0.3811
Table 2: Classification results for the different sets of features
depth 8 and it starts to over-fit beyond it. The error bars in this and the upcoming figures correspond
to the standard error of the sample mean.
Although decision trees have a bad performance in this scenario (accuracy 66.06%), they achieve
better results in the offline scenario, where a tree of depth 8 was 87% accurate. This can be explained
by examining the features with most Information Gain presented in Table 3.
We compared all the algorithms varying the number of training instances. The performance of
logistic regression based learner has been the best overall. In fact, using L1 regularization we achieve
the highest accuracy (70%), this shows that some of the features might be redundant and we can
perform more experiments using feature selection. For using Naive Bayes learner, we normalized
the values of the features in order to adjust their scales and to use the same value for smoothing all
the features when there are zero values.
Different feature sets: We evaluated the logistic based classifier (L1) using five sets of features:
two sets based on [3] are considered as baselines for each scenario, then two sets with extended
features and another one considering only the features extracted from the question content. The
Table 2 presents these results. Using the new features, the classifiers are slightly better than the
baseline.
Most informative features: Of the new text based features we added to our satisfaction prediction
model, only the misspelling ratio appears (with a weak value of 0.008) in the top Information Gain
features presented in Table 3. The features in the baseline, such as asker information, are much
superior to the ones from the text. This could be a reason as to why we observe a decrease in
performance by supplementing the baseline with richer textual features.
We also observed that all algorithms performed better at predicting the unsatisfied questions better
than the satisfied ones. This could be attributed to the inherent skewed distribution: there are 1.85
times unsatisfied questions as the satisfied ones. Its also possible that there exist questions for which
many users haven’t selected an answer, in spite of having gotten an answer.
Another interesting phenomenon we observe is that the seniority and successfulness of users seems
to divulge most information about the asker satisfaction. We can see that total number of questions
solved by asker and his membership age are the two most important features in the online scenario.
1.0 ●
●
●
● ●
●
0.9 ●
●
●
● ●
● ●
● ● ● ●
● ●
●
●
● ●
Accuracy
0.8
0.7
0.6
0.5
1 2 3 4 5 6 7 8 9 20 30 50
Tree depth
Dataset
● Test ● Train
Figure 1: Decision Trees accuracy varying the tree depth
4
5. Accuracy Precision
0.70 ● 0.60 ● ●
●
● ● ● ●
●
0.68 ● ●
● ●
●
● 0.55 ●
●
● ● ●
0.66 ●
●
●
●
● ● ●
● ● ●
● ● ● ●
0.64 ● ● 0.50 ●
● ●
● ● ● ● ●
● ●
0.62 ●
●
● 0.45 ●
0.60 ● ●
●
●
●
0.58 ● 0.40
● ●
Recall F1
● ●
●
● ●
0.40 ● ● ● ● ●
●
0.45
●
● ● ● ●
● ● ● ● ● ●
● ●
0.35 ● ● ● 0.40 ●
●
●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
0.30 ● ●
●
● ●
0.35
● ●
0.25 ● 0.30
0.20 ●
● ●
● 0.25 ● ●
● ●
1000 2000 3000 4000 1000 2000 3000 4000
Training instances
● Decision Tree ● Logistic Regression (L1) ● Logistic Regression (L2) ● Multinomial NB
Figure 2: Performance on Online Satisfaction Prediction task.
Online Offline
Feature Information Gain Feature Information Gain
Asker’s total number of questions solved 0.0453 Best answer score 0.5449
Asker’s membership age 0.0424 Highest answerer reputation 0.1139
Asker’s answers to question ratio 0.0338 Community score for question 0.11168
Asker’s average past question score 0.0307 Answerer’s reputation 0.10283
Asker’s recent questions score 0.0166 Top value of answerers’ accepted answer count 0.09611
Question code length 0.0112 Top value of answerers’ question solved count 0.09234
Question text unigram entity 0.0083 Top value of answerers’ answer count 0.09218
Question text misspelling ratio 0.0080 Answerers’ answer accepted count 0.09153
Question text bigram the top 0.0059 Reputation of most voted answerer 0.09129
Question tag android 0.057 Top answerer’s answer count 0.08176
Question content only
Feature Information Gain
Question code length 0.01122
Question text unigram entity 0.00838
Question text misspelling ratio 0.00809
Question text bigram the top 0.00594
Question tag android 0.00571
Question body length 0.00515
Question text unigram url 0.00481
Question text bigram using this 0.00458
Question text bigram reference to 0.00455
Question text bigram any ideas 0.00444
Table 3: Feature sets ranked by the Information Gain
Importantly, this is in accord with the general trend of success and satisfaction patterns we would
expect in CQA sites.
5 Task 2: Predicting Question Quality based on human assessment
Our main interest in the question quality task is to evaluate the classifiers on a semi-supervised set-
ting as a solution to training data scarcity. We continue the experiments performed for the previous
task, but this time we evaluate the classifiers on the dataset of annotated questions. In addition to the
evaluate the different algorithms using the features available on each scenario (online and offline),
we apply co-training in order to expand the training examples. In the following section we present
these experiments and their results.
Definition of Question Quality
The merit or quality of a question is a highly subjective factor that is difficult to quantify. Since
it cannot be measured directly or extended from any feature(s) available, we had to hand annotate
it. We define the quality of a question as a combination of five metrics: conciseness, politeness,
readability, detail and relevance rated on a scale of 1 to 5.
5
6. In addition to these metrics, we also annotated them with a quality label, that represents our judgment
on the question quality. We used this measure to understand the importance of the five metrics and
their combined influence on question quality.We found that conciseness, readability, politeness and
detail are reliable estimators of question quality. We looked at different patterns of values the metrics
took. The most consistent one was that high-quality questions had at least three metrics with values
greater than or equal to 3. We used the same as our label for question quality. This rough rule for
labeling was followed because high quality questions occurred in different forms: they were concise
and readable but not detailed, detailed and polite but not concise, readable and polite but not relevant
et cetera. In addition to that, our question-quality labeling rule also had considerable correlation with
our hand annotated quality metric.
5.1 Experimental Setup
We use the algorithms, features and evaluation metrics defined in the Asker Satisfaction task, but we
apply them for the dataset of the manually labeled questions. This dataset contains 172 instances
where 127 (73.83 %) of them are labeled as high quality, and the rest (45, 26.16%) as low quality.
Another difference in this task is that we perform 4-fold cross validation, since the number of training
instances is much smaller.
Co-Training: For increasing the number of training examples, we apply this technique as presented
in [2], making an adjustment for our task: since the target classes are not balanced, we ensure that
the classifiers add the appropriate number of instances that preserves the class distribution. We train
two classifiers where each one uses one of the set of features of the online and offline scenarios.
The following are the values that we assign to the four parameters of the co-training algorithm:
• p and n: number of positive and negative examples labeled by the classifiers and added to
the training pool. We set this values to p = 3 and n = 1.
• k Number of iterations. We set this parameter to k = 100.
• u Number of unlabeled examples that are labeled by the classifiers in each iteration. We
perform our experiments using different values.
For the evaluation, we use 4-fold cross validation as follows: we partition the set of labeled instances
in 4 subsets, maintaining the same class distribution on each subset. We select 3 subsets for training
the classifiers and leave the remaining subset for testing them after each iteration. The subsets
selected for training are going to be extended with the new instances labeled by the classifier, while
the test subset is not modified. In each iteration we evaluate the accuracy, precision, recall and F1
score of each classifier.
5.2 Experiment Results
Question quality and feature sets: We evaluated the same classifiers that we used on the Asker
satisfaction task but for predicting the label assigned after the annotation; we also compared them
using the two sets of features (online and offline). The averaged results are presented in Figure 3.
Again, the best results are obtained with the logistic regression based learner using L1 regularization,
with an accuracy of 0.74418 and a F1 score of 0.83851. However, for this task the classifiers are
more accurate and the features from the online scenario have higher predictability.
Cotraining: We evaluated the overall improvement across the iterations and the effect of varying the
parameter u (the number of unlabeled instances that the classifiers will label) on the performance of
each classifier. Figure 4 shows the F1 score of each classifier for each iteration of five experiments
running cotraining with different values for the parameter u.
We note that the accuracies of both classifiers improved, however the one for the offline scenario,
which initially was the weakest, improved with larger margins up to the point to achieve the perfor-
mance of the other classifier (when u = 75 and u = 100). Regarding the number of iterations, in
general the improvements occur within the first 50. At this point, the training set varies from 213 to
255 instances, depending on the value of u. This variation is the effect of the random sampling on
the set of unlabeled data, which is controlled by this parameter.
6
8. Feature set Initial F1 u Max. F1 Iteration with Max. F1 Gain
Offline 0.77662 25 0.83277 30 0.05615
50 0.82055 15 0.04393
75 0.85222 80 0.07560
100 0.83854 17 0.06192
200 0.81701 84 0.04040
500 0.81256 22 0.03594
Online 0.83852 25 0.86227 76 0.02375
50 0.85710 11 0.01858
75 0.86456 27 0.02604
100 0.85415 70 0.01563
200 0.85130 8 0.01278
500 0.86388 8 0.02536
Table 4: Maximum values of the F1 in co-training achieved varying the parameter u.
6 Conclusions
From the above experiments, we have learnt vital insights about the performance of the algorithms
in the asker satisfaction and question quality prediction. We realized the effect of the skew in asker
satisfaction in the classification accuracy values. We have also seen certain well known CQA trends
manifest in the feature analysis. The offline features clearly predict the asker satisfaction better. Our
approach to look at a question as having offline and online views has been successful, giving us
insights into question quality.
We characterized question quality, an abstract measure as a combination of particular aspects. Our
approach to quantify question quality showed the importance of the contributing metrics individu-
ally, besides revealing its high subjectivity, and we were able to train classifiers that predicted the
values of the annotations.
We used co-training to expand the training data; this increased the predictive performance of both
classifiers, but mainly the one based on the offline set of features. Furthermore, we learnt the the
quality and consistency of the annotations are a limitation of this technique in this scenario.
While we experimented with different learning algorithms and feature sets, we found that both asker
satisfaction and question quality can be modeled similarly. Through our experiments, we were able
to show that we can improve question quality prediction by defining more specific features and by
expanding the training set using semi-supervised learning.
References
[1] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. Finding
high-quality content in social media. Proceedings of the international conference on Web search
and web data mining WSDM 08, page 183, 2008.
[2] Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-training. In
COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann
Publishers, 1998.
[3] Yandong Liu, Jiang Bian, and Eugene Agichtein. Predicting information seeker satisfaction
in community question answering. Proceedings of the 31st annual international ACM SIGIR
conference on Research and development in information retrieval SIGIR 08, (Section 2):483,
2008.
[4] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-
rot, and Duchesnay E. Scikit-learn: Machine Learning in Python . Journal of Machine Learning
Research, 12:2825–2830, 2011.
[5] C Shah and J Pomerantz. Evaluating and Predicting Answer Quality in Community QA. Li-
brary, (March 2008):411–418, 2010.
8