1. Assignment 6
Analyzing response time in Q&A websites
Question and Answer (Q&A) sites like StackOverflow, Yahoo! Answers, Naver, Quora,
LiveQnA, WikiAnswers etc. are becoming increasingly popular with the growth of the Web.
These are large collaborative production and social computing platforms of the Web, aimed at
crowdsourcing knowledge by allowing users to post and answer questions. They not only
provide a platform for experts to share their knowledge and get identified but also help novice
users solve their problems effectively. StackOverflow is one such communitydriven Q&A
website used by more than a million software developers who post and answer questions
related to computer programming. It is governed by a reputation system which rewards the
users by giving reputation points, badges, extra privileges on the website, etc. by the usefulness
of their posts. The usefulness of a question or an answer is largely determined by the number of
votes it receives.
In such a crowdsourced system driven by a reputation mechanism, response time of
questions to receive the first answer plays an important role and would largely determine the
popularity of the website. People who post questions would want to know the time by which they
can expect a response to their question. In this assignment, we want to investigate whether
besides several other factors, tags of a question have strong correlation with response time.
Tagging questions involves askers selecting appropriate keywords (e.g., android, jquery, c#) to
broadly identify the domains to which their questions are related. There also exist mechanisms
by which other users can subscribe to tags, search via tags, mark tags as favorites, etc. As a
result, tags should play a crucial role in how the questions are answered and hence determining
their response time.
Input Dataset:
http://gaming.stackexchange.com/
(Dataset https://archive.org/download/stackexchange/gaming.stackexchange.com.7z) is
a sister site of StackOverflow where questions related to Gaming are discussed. We have
attached the datadump of the website till 26th September, 2014. Download and Unzip the
dataset and you will find the following files
● Badges.xml
● Comments.xml
● PostHistory.xml
● PostLinks.xml
● Posts.xml
● Tags.xml
● Users.xml
● Votes.xml
2.
Information about all the posts (questions and answers) and tags can be found in “Posts.xml”
and “Tags.xml” files respectively. Examples from each of the files are given below.
Typical Question
<row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="20140514T00:11:06.457"
Score="1" ViewCount="185" Body="<;p>;As a researcher and instructor, I'm looking for
opensource books (or similar materials) that provide a relatively thorough overview of data
science from an applied perspective. To be clear, I'm especially interested in a thorough
overview that provides material suitable for a collegelevel course, not particular pieces or
papers.<;/p>;
;" OwnerUserId="36" LastEditorUserId="97"
LastEditDate="20140516T13:45:00.237" LastActivityDate="20140516T13:45:00.237"
Title="What opensource books (or other materials) provide a relatively thorough overview of
data science?" Tags="<;education>;<;opensource>;" AnswerCount="3"
CommentCount="4" FavoriteCount="1" ClosedDate="20140514T08:40:54.950" ></row>
Typical Answer
<row Id="10" PostTypeId="2" ParentId="7" CreationDate="20140514T00:53:43.273" Score="8"
Body="<;p>;One book that's freely available is ";The Elements of Statistical
Learning"; by Hastie, Tibshirani, and Friedman (published by Springer): <;a
href=";http://statweb.stanford.edu/~tibs/ElemStatLearn/";>;see Tibshirani's
website<;/a>;.<;/p>;
;
;<;p>;Another fantastic source, although it isn't a book,
is Andrew Ng's Machine Learning course on Coursera. This has a much more appliedfocus
than the above book, and Prof. Ng does a great job of explaining the thinking behind several
different machine learning algorithms/situations.<;/p>;
;" OwnerUserId="22"
LastActivityDate="20140514T00:53:43.273" CommentCount="1" />
Typical Tag
<row Id="3" TagName="bigdata" Count="46" ExcerptPostId="66" WikiPostId="65" />
Output Deliverables:
A. Feature Calculation
You should use Java to parse these xml files and for each question, calculate the
response time and the following tag based features:
1. tag_popularity: We define popularity of a tag t as its frequency, i.e., the number of
questions that contains t as one of its tags. For each question, you should compute the
average popularity of all its tags.
3. 2. num_pop_tags: We consider a tag to be popular if its frequency is more than 20. Here
you should count the number of popular tags each question contains. There will be atmost
6 boxes in plot as each question can contain at max 5 tags.
3. num_subs_ans: We define an “active subscriber” of a tag t to be a user who has posted
“sufficient” answers in the “recent past” to questions containing t. We say that a user has
posted “sufficient” answers when the number of their answers is greater than 5 and by
“recent past” we mean answers posted after 7th Jan 2014. After computing the number
of active subscribers for every tag, you should compute the average number of active
subscribers for individual tags in each question.
4. percent_subs_ans: For each tag, you should also compute the ratio of the number of
“active subscribers” to the total number of subscribers, where the total number of
subscribers indicates the number of users who have posted at least one answer to a
question containing a particular tag. After computing the ratio for every tag, you should
compute the average ratio for individual tags in each question.
B. Feature Analysis
To analyze the question features and their correlation with response time, you should
construct plots of the response time against the values of different features. You should
distribute the feature values into ten equal bins and then use gnuplot to produce the following
two plots:
1. Box plots that capture the median, 25% and 75% of the response time distributions, as
well as the minimum and maximum values, and
2. Cumulative distribution function (CDF) plots of the response time.