The document describes a self-training framework for detecting exploratory discourse in online conversations. It involves initially training a classifier on a small set of annotated data, then using the classifier to annotate additional unlabeled data and adding it to the training set. This allows the classifier to be retrained and improved without requiring manual annotation of large amounts of data. The framework is evaluated on chat data from an Open University conference, and a feature-based self-training approach is shown to improve performance over supervised classifiers and other baselines. Applications for visualizing discourse and participation are also discussed.
A Beginners Guide to Building a RAG App Using Open Source Milvus
A self training framework for exploratory discourse detection final
1. A self-training framework for
exploratory discourse detection
Zhongyu Wei
SoLAR symposiumOpen University,UK, 26 June 2012
PhD student, SEEM, The Chinese University of Hong Kong, Hong Kong
SocialLearn intern, Open University, UK
zywei@se.cuhk.edu.hk
5. How many points in the webinar
triggered learning/knowledge-building?
This person contributes a lot during the
chat.
This part appears to have very good
content that will provoke deeper learning
Data in this study taken from a 2 day OU conference in Elluminate & Cloudworks:
6. Exploratory dialogue analysis
Exploratory dialogue
……represents a joint, coordinated from of co-reasoning
in language, with speakers sharing knowledge,
challenging ideas, evaluating evidence and considering
Categor ... …
options Description Example
y
Challen Identifies that something
ge may be wrong and in need I disagree. Freemind is a superb
of correction piece of software to use...
Evaluati Has a descriptive quality That's a really interesting
on approach
Extensio Builds on or provides I've embedded helen's slide
n resources that support share over in cloudworks
discussion http://link.com
Reasoni The process of thinking an Why intranet only? What
Mercer, N. (2004). Sociocultural discourse analysis: analysing classroom talk as a social mode of thinking. Journal of Applied Linguistics, 1(2),
137-168.
ng idea through. meaning CLOSED in
7. Low exploratory dialogue
Time Contribution
3:12 PM LOL
3:12 PM It's not looking good.
3:13 PM Sorry, had to do that.
3:13 PM jaaa
3:13 PM Ouch!
3:13 PM It was a vuvuzela.
3:13 PM I though that was you @Alistair
3:13 PM I've taken away the vuvuzela from you now!
3:13 PM LOL
8. Higher exploratory dialogue
Time Contribution
2:42 PM I hate talking. :-P My question was whether "gadgets" were just
basically widgets and we could embed them in various web sites,
like Netvibes, Google Desktop, etc.
2:42 PM Thanks, that's great! I am sure I understood everything, but looks
inspiring!
2:43 PM Yes why OU tools not generic tools?
2:43 PM Issues of interoperability
2:43 PM The "new" SocialLearn site looks a lot like a corkboard where you
can add various widgets, similar to those existing web start pages.
2:43 PM What if we end up with as many apps/gadgets as we have social
networks and then we need a recommender for the apps!
2:43 PM My question was on the definition of the crowd in the wisdom of
crowds we acsess in the service model?
9. Exploratory dialogue detection
Problem Statement
Given an online chatting session S = {d0, d1 … …dn}, dk
stands for the kth dialogue, classify dk as exploratory or
non-exploratory.
Solution from learning analytics
Sociocultural discourse analysis method
Manual
High precision and low recall
Category Cue phrases
Challenge But if, have to respond, my view
Evaluation Good example, good point
Extension More links, for example
Reasoning That is why, next step
10. Exploratory dialogue classification
Explorator
Explorator y
y
Dialog Discourse
ue Non-
Classifier
Explorator
y
Dialogue is represented by a feature vector.
{I think she is right}{I, think, she, is, right, I-
think, think-she, she-is, is-right, I-think-she, think-
she-is, she-is-right}
11. Exploratory dialogue classification
Instance-based supervised classifier training
Explor
Explor
Explorato
atory
atory Explorator
ry Classifier y
Training Discourse
Non-
Non-
Non- Classifier
Explorat
Explorat
Explorator
ory
ory
y
Feature-based supervised classifier training
Explor
Explor
Explorato Explorato
atory
atory Feature Classifie
ry Featu ry
Generati re r
List Discourse
Non-
Non- on Training
Non- Classifier
Explorat
Explorat
Explorator
ory
ory
y
12. An example of feature list
Feature Exploratory Non-
Exploratory
what-is 0.9992 0.0008
good-point 0.9995 0.0005
your-audio- 0.001 0.999
should
thank-you 0.004 0.996
my-name 0.07 0.93
13. A self-training framework
Annotat
Classifier Classifie
ed
training r
data
Step 1: Training initial classifier on annotated
data.
Annotated data is time consuming to obtain
14. A self-training framework
Unlabele
d
data
Annotat
Classifier Classifie
ed
training r
data
Annotat Pseudo- Instance
ed annotate Selection
data d data
Step 2: Classify unlabeled data, select high
confidence instances and combine them with
annotated data
Step 3:Re-train classifier on the augmented training
15. A self-training framework
Unlabele
d
data
Explorator
Annotat
Classifier Classifie y Resul
ed
training r Discourse ts
data
Detection
Annotat Pseudo- Instance Test
ed annotate Selection data
data d data
Step 4: Obtain final classifier: No improvement on
validation dataset; After a certain iteration; No class
label changes.
Step 5: Detect exploratory dialogues on the test data.
16. A self-training framework
Unlabele
d
data
Explorator
Annotat
Classifier Classifie y Resul
ed
training r Discourse ts
data
Detection
Annotat Pseudo- Instance Test
ed annotate Selection data
data d data
Self-training will introduce noisy instance.
17. KNN based Instance Selection approach
K nearest neighbors classification
Blue stands for “exploratory”
Gray stands for “non-exploratory”
1 nearest neighbor is “exploratory”
2 nearest neighbors is
“exploratory”
5 nearest neighbors is “non-
exploratory”
18. KNN based Instance Selection approach
Pseudo annotated instances P = {p1,p2,…
…pn }
pk = (lk, ck) . Lk is pseudo label, ck is
confidence value
Form a candidate list
Choose instances with ck > r
For pk in the candidate list, identify the K
nearest neighbors and update the pseudo
label of pk by KNN
Obtain new pseudo annotated instances P-
updated
20. Data source: OU online
conference
4 sessions including 2634 posts.
Data in this study taken from a 2 day OU conference in Elluminate & Cloudworks:
21. Annotation
2 Annotators with one morning training.
Four categories are given.
Kappa value (binary) is 0.5978 (moderate).
Only posts with the consistent labels are
collected. Total# Exploratory # Non-Exploratory
Session
#
OU_22A 529 380 149
M
OU_22P 661 508 153
M
OU_23A 456 310 146
M
22. Experiment Setup
Baseline:
CP: Cue phrase based method
MEGE: Supervised Max Entropy GE (Generalized Expectation)
approach (feature based)
ME: Supervised Max Entropy approach (instance based)
SMEGE: Self-training Max Entropy GE approach (feature based)
SME: Self-training Max Entropy approach (instance based)
Experiment Setup
Use one session as training part, one session as testing part, one
session as validation
During the self-training process, examples include cue-phrase
are added to training dataset at the first stage.
Pseudo samples are added with the same ratio of exploratory
and non-exploratory as training dataset.
Confidence value 0.8
Feature threshold 0.65
28. Time line Visualization
80
60
40
20
0
9:28
9:32
10:13
11:48
12:00
12:04
12:05
9:36
9:40
9:41
9:46
9:50
9:53
9:56
10:00
10:05
10:07
10:07
10:09
10:17
10:23
10:27
10:31
10:35
10:40
10:45
10:52
10:55
11:04
11:08
11:11
11:17
11:20
11:24
11:26
11:28
11:31
11:32
11:35
11:36
11:38
11:39
11:41
11:44
11:46
11:52
11:54
12:03
-20
-40
1. anybody else with poor audio?
2. is anyone else Exploratorydifficulty hearing this?
Average having …
-60 3. background noise makes it difficult to hear
1. Sheffield, UK not as sunny as yesterday - 1. See you!
still warm 2. bye for now!
2. Greetings from Hong Kong 3. bye, and thank
3. Morning from Wiltshire, sunny here! you
4. Bye all for now
29. Time line Visualization
Time User Id Content
added to which 2M often drops to 10% of that in peak
11:46 AM User_2
80 times
I really disagree - ECDL was the starting point for many
11:47 AM User_3
60 many first time users
11:47 AM User_1
40 online basics won't load in final third first
11:47 AM User_1
20 mobile won't work round her
11:47 AM User_1
0 and satlellite costs 40 a month for 1 gig data transfer
9:28
9:32
10:13
11:48
12:00
12:04
12:05
I think the issue about the skills needed to really embrace
9:36
9:40
9:41
9:46
9:50
9:53
9:56
10:00
10:05
10:07
10:07
10:09
10:17
10:23
10:27
10:31
10:35
10:40
10:45
10:52
10:55
11:04
11:08
11:11
11:17
11:20
11:24
11:26
11:28
11:31
11:32
11:35
11:36
11:38
11:39
11:41
11:44
11:46
11:52
11:54
12:03
-20
technologies is a huge one and with web 2.0 technologies
-40 things are becoming more complicated, as I say often you
dont just get this stuff by attending a workshop, you have
Average Exploratory …
-60 to participate and appropriate them to your interests and
11:47 AM User_4 context and network of others.
We use myguide on mobile broadband for outreach.
Works OK, but not great and thats in city centre
11:47 AM User_5 boardering 3G/GPS.
30. User Visualization
Contribution Distribution of Users
50
Exploratory Message Count
45
40
35
Time User Id Content
30 because although some people can
25 11:42 get 'online' the feed is so poor that
20 AM User_1 many pages won't load. eg myguide
15 how much time and money was spent
10 11:42 getting everyone to use a mobile
5 AM User_1 phone?
0 nothing. because it was perceived to
0 10 20 30 40 50 60
be useful, therefore there is no need
Total Message Count time and money on
to spend
11:43 digitalinclusion, until the access to the
AM User_1 internet works
in order to get a 2meg connection to
11:44 everyone we need fibre to the final
AM User_1 third
31. User Visualization
Contribution Distribution of Users
Time
50 User Id Content
Exploratory Message Count
9:51
45 Hello Im a tutor at Saudi arabia
AM
40
User_6 branch
35
9:51
30
AM Moderator hello Saudi Arabia!
25
9:51
20
AM User_6 hi
15
9:52 Welcome Ashawa - did we meet in
10
AM Moderator Kuwait a couple of years ago?
5
9:52
0
AM 0 User_6
10 20 no actually
30 40 50 60
9:52 Total Message Count
AM Moderator @ashawa - maybe next time
9:52
AM User_6 yes I wish
32.
33. i
i
This step appears to have very good
content that will provoke deeper learning
i
i
This step appears to have some content
that will provoke deeper learning
i
i
This step appears to have little content
that will provoke deeper learning
34. Conclusion
We have extended our previously proposed self-
training framework for exploratory discourse
detection in synchronous textchat (Elluminate
conference sessions).
Propose a K Nearest Neighbors algorithm based
instance selection method.
Applied the proposed approach to SocialLearn
platform.
35. Future Work
Text analytics:
Integrate KNN instance selection method into the
self-training framework
Explore other features for exploratory dialogue
classification: inter-dialogue features, global
features.
Build a more reliable dataset for sub-category
classification, challenge, evaluation, reasoning, e
xtension.
36. Future Work
Visual analytics:
Investigate how these can be rendered most
usefully for educators and learners
Investigate user feedback when deployed
Different users will appreciate different levels of
detail
Purdue Signals experience suggests that complex
underlying analytics should be usefully distilled into
very simple feedback
But as analytics literacy grows, will users value
more powerful insights?
37. Acknowledgments
Thanks for the guidance and consideration of Dr.
He Yulan, Dr. Simon and Dr. Rebecca.
Thanks for the consideration from all the other
colleagues in Knowledge Media Institute.
38.
39. Zhongyu Wei
The Chinese University of Hong Kong, Hong Kong
http://www.se.cuhk.edu.hk/~zywei/
Yulan He
The Open University, UK
http://people.kmi.open.ac.uk/yulan/
Simon Buckingham Shum
The Open University, UK
http://oro.open.ac.uk/view/person/sjb72.html
Rebecca Ferguson
The Open University, UK
http://oro.open.ac.uk/view/person/rf2656.html
Hinweis der Redaktion
Here is the example in Elluminate, which is a web conference tool that supports chat along sides video, slides and presentations. Everyday, there are hundreds of materials are recoded and uploaded.
In the middle panel, there are chat texts for this record. And the left one shows us all the users in the chatting room. The material here can be hours. It is very time consuming for you to read all these content. Oh, god, would you please tell me which part is critical and worthy to read? Just like this! Isn’t it wonderful if someone help you figure out which part is most important? In addition, those users who are worthy to focus.OK, that is what we want show you.