Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Â
Putting the world to work for ITS
1. Putting the world to work for
ITS:
Open community authoring of
targeted worked example
problems
Aleahmad, Aleven and Kraut
6/27/2008 ITS 2008
2. Current situation in tutoring
2
systems
• Development is very laborious
• (e.g. estimates of 200-300 hrs for 1 hr instruction)
• Small groups with much effort per person
• Distribute the development
• Open source
• Open content
• How to make a “Wikipedia” for ITS?
4. Towards a collaborative
4
community
• Volunteers • Others rate and
submit new critique
material
Generate Evaluate
Improve Use
• Link resources
• Others make the into tutoring
contribution systems or
better create new ones
5. Broad research questions
5
 If you make it, will they come?
 Can the wheat be separated from the chaff?
 How to structure and support authoring?
 For quality
 For diversity to engage students
– Contextualization, personalization, and provision of choices can
improve student motivation and engagement in learning (Cordova
and Lepper, 1996 )
– Personalization improves performance gains and even at start
(Anand and Ross, 1987; Ku and Sullivan 2002; LĂłpez and Sullivan
1992)
6. Overview of the study
6
 Web site where people contribute worked
example problems
 In registering, indicated their professional
status
 Tested a mechanism to increase quality and
diversity
 Asked some authors to target to a specific person
 Increase their effort?
 Increase diversity/adaptivity of corpus?
7. Task
7
• Artifact: Worked example problem
– Leads to better and more efficient learning when
added to interactive tutoring (McLaren et al., 2006;
Schwonke et al., 2007)
– Instruct and foster self-explanation (Renkl and
Atkinson, 2002)
– Customizability – both to the student and the
interaction
• Domain: Pythagorean Theorem
– Most difficult skill on the Massachusetts
Comprehensive Assessment System curriculum
standards (ASSISTment data)
8. Zack and Slater want to build a bike jump. They have
two parts of the ramp constructed but they need to
Problem know the length of the final piece of the jump. They
have two parts of the ramp built, one is 3 ft long and
Stateme the other is 4 ft long and they are constructed as
shown in the diagram. What is the length of the
missing section that Zack and Slater still need to
nt construct?
+ Work Explanation
Solution The unknown is the hypotenus
which is represented by c in the
3^2 + 4^2
steps equation. Therefore I input both a
and b into the equation first.
Following the equation I square
both of these numbers.
= 9 + 16 = These two numbers are added
25 together first because of the
Whole parenthesis.
To complete the equation I take
worked Square the square root of 25 which is five.
This problem also demonstrates
root of 25
example is 5 and
this is the
the common Pythagoras triangle.
solution.
8
10. Open authoring hypotheses
10
 H1: Identifying the good from the bad
contributions is easy. We expect that all
contributions are good, easily fixed, or easily
filtered.
 H2: Math teachers submit the best
contributions.
11. Student profiles
11
 Goal of realism
 Varied on social and cognitive attributes
 16 profiles
 4 Hobbies x 4 Homes
 4 realistic skill profiles distributed
 2 genders distributed
12. Profile hypotheses
12
Profiles in experimental condition versus generic control
condition
 H3: Student profiles lead to tailored
contributions.
 H4: Student profiles increase the effort of
authors.
 H5: Student profiles lead to higher quality
13. Participants and contributions
13
• Participation URL posted on web sites
(educational and otherwise) offering $4-12
• 1427 people registered, of which 570 used the
tool to submit 1130 contributions
• After machine filtering, 281 participants were
left having submitted 551 contributions
Participation Math teachers Other teachers Amateurs
Registered 131 170 1126
Contributed also 70 72 428
Passed vetting 26 35 220
also
15. Quality ratings
15
Human experts rated the machine vetted submissions
Numerical Rating
value category Definition
No use in teaching and it would be easier to
0 Useless write a new one than improve this one.
Has some faults, but they are obvious and
1 Easy fix can be fixed easily, in under 5 minutes.
Worthy of being given to a student who
matches on the difficulty and subject matter.
2 Worthy Assume that the system knows what's in the
problem and what is appropriate for each
student, based on their skills and interests.
Excellent example to provide to some
student. Again, assume that the system
3 Excellent knows what's in the problem and what is
appropriate for each student, based on their
16. Quality rating examples
16
 Excellent statement with poor solution (1124)
 Worthy statement with excellent solution (337)
19. Quality by contributor expertise
19
Statement quality Solution quality
Teacher Sign. Mean Std Teacher Sign. Mean Std
status diffs quality Err status diffs quality Err
Math A 1.80 0.12 Math A B 0.70 0.10
teacher teacher
Other B 1.54 0.09 Other B 0.53 0.08
teacher teacher
Not B 1.48 0.09 Not B 0.76 0.03
teacher teacher
21. Tailoring to social attributes
21
With profiles
With With profiles
not F-test F-test
Attribute GENERIC
mentioning
mentioning
(G-M) (N-M)
(G) attribute (M)
attribute (N)
Female pronoun 5% 4% 16% 9.68* 12.82**
Male pronoun 19% 14% 19% 0.004 1.19
Sports word 9% 9% 24% 18.01** 11.89**
TV word 4% 4% 10% 8.36* 2.63â€
Music word 2% 2% 9% 6.92* 8.93**
Home word 14% n/a 20% 3.60* n/a
Probabilities of authoring matching an attribute
†p<.10 *p<.05 **p<.001
24. Tailoring to cognitive attributes
24
Verbal skill in profile General math skill in profile
Verbal Sign Mean Std Math Sign Probability Std Err
skill . reading Err skill . of using 3-
shown diffs level of shown diffs 4-5
contribution triangle
High A 3.78 0.24 High A 16% 0.05
Medium A B 3.56 0.32 Medium A B 26% 0.05
Low B 2.93 0.33 Low B 27% 0.04
GENERI B 3.20 0.16 GENER A B 21% 0.03
C IC
Correspondence of verbal and math skill levels with the authoring interface
26. Effects of profiles
26
On effort On quality
 Problem statements  No main effect of
in profile condition profiles on quality
were 25% longer  No interaction with
 No significant teacher status either
difference in time
spent (median 5
each minutes on
statement and
solution)
28. Recap of Hypotheses
28
Hypothesis Short Long Answer
Answer
1 Quality control is easy Yes Filtering trivial; rating by experts
take less than a minute
2 Math teachers contribute Partly Amateurs and non-math teachers
the best worked examples wrote okay problem statements
and amateurs wrote better
solutions
3 Profiles lead to tailoring Yes Every aspect of profiles was
tailored to
4 Profiles increase effort Inconclusiv A quarter longer problem
e statement, but no difference in
time
5 Profiles lead to higher No No difference in machine filtering
quality contributions or human rated quality
29. Current and future work
29
• Volunteers • Others rate and
submit new critique
material
Generate Evaluate
Improve Use
• Link resources
• Others make the into tutoring
contribution systems or
better create new ones
30. Current and future work
30
• Volunteers • Others rate and
submit new critique
material
Generate Evaluate
Improve Use
• Link resources
• Others make the into tutoring
contribution systems or
better create new ones
31. Current and future work
31
• Volunteers • Others rate and
submit new critique
material
Generate Evaluate
Improve Use
• Link resources
• Others make the into tutoring
contribution systems or
better create new ones
32. Acknowledgements
32
 Thanks to ASSISTment project, Ken
Koedinger and Sara Kiesler for data and
feedback
 Work supported by IES and NSF
 It’s going to take a lot of connected work to
build a scalable shared ITS for the world
 Let’s talk more about how
 http://OpenEducationResearch.org
33. Gratis participants
33
 Still 93 submissions from 92 participants
 Of these 38 submissions from 21 participants
pass machine vetting
 41% pass rate of machine vetting compared to
49% rate in experiment
 Not significantly different by Fisher's Exact
Test (p=0.16)
Hinweis der Redaktion
[Insert a graphic to start this off]
developing ITS is expensive and it’s done in small groups.Lots of work by skilled experts in the groupslet’s figure out how to distribute it. Open source (Linux) and open content (Wikipedia) show us it can be done.No large scale collaboration systems for ITS authoring. The goal here is something of a Wikipedia for tutoring.Then next slide, Wikipedia on PT.
But Wikipedia itself is not the right model. E.g. this Wikipedia entry is geared to people who already know the math and want more details.No learning by doing. No doing in Wikipedia at all. If you put such information into Wikipedia, you get a note you move to Wikiversityand Wikibooks. But if you look at those, they have hardly any content. Wikis are awkward for instructional material because they attempt to be canonical.But students learn in diverse ways and at different rates.Let’s allow divergence of resources so that materials can be tailored specifically to each student.
The work here is part of a larger study into a working collaborative community. The vision is for a model of development that is cheaper than existing methods, leads people to think more about learning, and can evolve to be the best.[walk through the cycle]The study I’m going to tell you about is in the Generation phase of the cycle, where people submit new material.[save for end:Here Improve leads to Generate because each improvement is actually a new artifact that then gets evaluated on its own. We can come back to full cycle at the end.
To build such a system requires an understanding of the social context, so we begin by studying it empirically.[Emphasize why we want these things]e.g. if we made it, would enough people use it?Would picking out the good stuff be feasible?How can the system foster quality adaptive materials?
So to examine these, we created a prototype authoring system that works over the webParticipants show up to the site and contribute new worked example problemsWe wanted to see the quality of what people contributed and how hard it is to pick out the good stuff.We also wanted to see if we could increase the quality and diversity of the contributions by manipulating the authoring tool.so some participants were asked to target their worked example to a specific person
In more detail…ARTIFACTWorked examples improve learning, particularly when coupled with interactive tutoringNot very different from a simple inner loop (which by Van Lehn’s Law simple may be good enough)DOMAINPT most difficult in Assistment data at the timePerhaps machines could make exercise text more efficiently that volunteers, but not drawings.
To get an idea of what was made, here is a worked example that one of the participants submitted.
This is the tool they used to make them.
Specific hypotheses in separating wheat from chaff are that
Let me describe the experimental condition.Half the participants, randomly assigned, were asked to help “the student above” in understanding the Pythagorean Theorem. They would see one of 16 different student profiles at the top of the authoring tool.[go slowly through pics, read them out][use pointer to contrast the features. Flip back and forth.]
We expect these differences would do these three things.
even when I took away the money, people contributed at roughly the same quality levels. (wait for them to ask for the final slide)Of the 1130, filtered out the contributions that didn’t follow the form. Calling “machine filtering” because simple SQL query without human intervention. So it’s very easy to do a first pass quality filter.Here we compare depth of participation across three types of participants: math teachers, other teachers, and amateurs.
Easy to filter.
After the machine filter, the remaining submissions were coded for quality by two geometry teachers.Here ratings and definitions they used [read them]Three components of each problem were rated: problem statement (Statement), the work shown (Work), and the explanation of the work(Explanation). median time to rate was ~40 seconds and they agreed alpha=0.8.So it’s pretty easy for people to accurately rate the quality.. Overall then, separating the wheat from the chaff in a production system will be feasible.
[read out the legend][put screenshots into this the way I did with the Wikipedia page][figure out what I want them to get from the examples. Scrolling back and forth is impossible. Cut down to two. Zoom into parts to talk about. Include where the three components differed and point that out Explain the color codes.]
[
Here is the quality distribution of all the original 1130 contributions after machine filtering and human ratings. Whole here is the value from averaging the statement and the solution.See in Filtered column, Over half filtered instantly by SQL query. Other columns are the 551 human rated.In general the statements were of higher quality than the solutions. Over 300 were worthy without modification.We see that solutions were the most difficult parts to authorwell. And there were effectsby expertise…
As predicted in H2, math teachers did write the best problem statements. See the A and B groupings of significant differences.Surprisingly, their solutions weren’t any better than the amateurs. Amateurs did slightly though not significantly better than math teachers. Comparing amateurs to teachers all together, amateurs did significantly better.The take-away from this is that non-professional educators produce valuable contributions, which can exceed those of professionals. And educational content systemscan benefit from opening the channels of contribution to all comers.
Here are the results of the student profiles manipulation. Focus on these columns [explain]Most remarkable is the use of gender pronouns. Pronoun attributes mean presence of that pronoun in the problem statement. Generic condition is like a normal authoring tool, 19% of problems discuss males but only 5% discuss females. When you show a student profile that is female, she pronouns are included in 16% of those problems. Though this is still less than the 19% males, which is the same rate even when you show a male. Clearly males are the default mindset.Another strong effect by including the sports hobby, discussion of sports went from 9% to 24%.Same pattern for all the other social attributes in the profile (well except favorite color).
To give you an idea the tailoring. [read out loud]Here they used the 3-4-5 Pythagorean Triple.
Another drawing on the profile details. [read out loud]
So when shown a student profile, people tailor their contributions to the social attributes shown. What about the cognitive attributes?We expect the difficulty measures in contributions for high skill profiles to differ from the low. And that’s in fact what we see. Comparing the reading level of the contributions, High verbal profiles were significantly different from low, by almost a grade level. Also significantly different from generic.Same situation with math skill, measured by probability of making the problem around the 3-4-5 triangle, the simplest Pythagorean triple. So in the Generic condition, 21% of the problems used the integers 3, 4 and 5.
Here’s anotherexample,one of my favorites. The student was High in verbal, “top of their English class”. The authoring customized not just the difficulty but the engagement of the content.
The last two hypotheses were not confirmed. Not clear effect on effort. While problem statement in the profile condition were 25% longer, this may not be a good measure. Another measure, time spent, had no significant difference.Profiles had no effect on quality. There was no difference in quality between the conditions.
More parts of the design to study.
Right now I’m running a second web study of how people evaluate and improve the problems from the study described here.
I plan to develop production web site in the fallfor educators to create, use, improve, and discuss worked example problems. Part of this will be how to motivate contributions (in the form of original works, improvements, feedback, etc.)If the system grows enough, I look forward to classroom studies in which students are involved in making, rating and improving problems.I also intend to provide open data APIsand linking in with other projects. I think this collaborative system will best built collaboratively.