Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

What's in a textbook

31 Aufrufe

Veröffentlicht am

Description of the research projects on Intelligent and Adaptive textbooks that I have been involved in

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

What's in a textbook

  1. 1. Sergey Sosnovsky What’s in a textbook?
  2. 2. Architecture of an AES Instructional Content Interaction User Model 0..1..1. .0..1..1 ..Adaptation Model Adaptation M e t a d a t a Domain Model 2 2
  3. 3. Math-Bridge: Rich Adaptive and Intelligent Textbooks Seite/Page 3 Sosnovsky, S., Dietrich, M., Andrès, E., Goguadze, G., Winterstein, S., Libbrecht, P., Siekmann, J., & Melis, E. (2014). Math-Bridge: Bridging the gaps in European remedial mathematics with technology-enhanced learning. In T. Wassong, D. Frischemeier, P. R. Fischer, R. Hochmuth, & P. Bender (Eds.), Mit Werkzeugen Mathematik und Stochastik lernen – Using Tools for Learning Mathematics and Statistics (pp. 437-451). Berlin/Heidelberg, Germany: Springer.
  4. 4. Intelligent Problem Solving Support 4
  5. 5. Personalized Course Generation 5
  6. 6. Adaptive Navigation 6
  7. 7. Metadata annotation Metadata annotation error- prone time- consu- ming limited support of tools often many authors often non expert authors difficult Seite/Page 7 •Math-Bridge metadata schema has more than 30 elements •Math-Bridge content collection contains more than 10 000 learning objects •About 50 people were involved in preparing this collection
  8. 8. The Burden of Authoring §Learning content authoring has always been Tedious, Expertise demanding, Poorly supported §Content & Knowledge authoring for Adaptive Intelligent Systems requires a lot of extra efforts §!!! Information & Knowledge existing in the system should become not the authoring burden but the vehicle for authoring support !!! Seite/Page 8 Instructional Content Authoring for e-Learning Metadata InstructionalContent Authoring for Adaptive e- Learning Instructional Content Authoring for Adaptive e- Learning as It Should Be
  9. 9. Semantic Gap Detection F O U R M A I N S T E P S : Conversion of Metadata to OWL2 Detection of Ontology Inconsistencies Isolation of Causing Axioms Generation of Verbal Explanations Seite/Page 9 Sosnovsky, S., & Alpizar-Chacon, I. (2014). Semantic gap detection in metadata of adaptive learning environments. In Proceedings of ICALT'2014: 14th International Conference on Advanced Learning Technologies (pp. 548-552). IEEE Computer Society.
  10. 10. Math-Bridge Metadata Schema Seite/Page 10
  11. 11. Step 1: Conversion of Metadata to OWL2 OWL2 XSLT Stylesheet OMDoc Seite/Page 11
  12. 12. Step 2: Detection of Ontology Inconsistencies rdfs:domain rdfs:range owl:ObjectProperty activemath: hasDomainPrerequisite intro_bikers_slope activemath:Text rdf:type activemath: KnowledgeItem ex_tour_de_fr activemath:Example rdf:type activemath:ConceptItem activemath: SateliteItem Inconsistent! Seite/Page 12
  13. 13. Step 3: Isolation of Causing Axioms Seite/Page 13
  14. 14. Step 4: Generation of Verbal Explanations Seite/Page 14
  15. 15. The Scale of the Problem Interaction Adaptation 15
  16. 16. Textbooks as a source of (extractable) knowledge • Focus (narrow, cohesive domain) • Quality (created by domain experts) • Purpose (content explains domain knowledge to a novice) 16 • sections / subsectionsStructure • easy to complexOrder • ..of content and headersFormatting • indices • tables of content Additional structural elements •Underlying content •Textual Labels Topics/subtopics •Prerequisites <-> outcomes Pedagogical relations •header vs important vs regular •same format = same role Text types/roles and relations •Glossary of curated meaningful terms •Set of important domain categories Meaningful labels • If automatically extracted and formally represented these elements will form the model of the textbook and the model of the domain as the author understands it
  17. 17. Linking Textbooks to Ontologies 17 Topic-based model of an HTML-based Java textbook automatically extracted and mapped to a central ontology already linked to a set of Java exercises • Mapping serves as a bridge to jointly interpret learner’s reading and exercise attempts in terms of ontology and adapt access to textbook pages accordingly Project 1 1.Sosnovsky, S., Hsiao, I-H., & Brusilovsky, P. (2012). Adaptation “in the wild”: Ontology-based personalization of open-corpus learning material. In A. Ravenscroft, S. Lindstaedt, C. Delgado Kloos, & D. Hernández-Leo (Eds.), Proceedings of EC-TEL'2012: 7th European Conference on Technology Enhanced Learning (pp. 425-431). Berlin/Heidelberg, Germany: Springer.
  18. 18. Linking Textbooks to Textbooks Several LDA-based techniques are used to interlink sections from a set of HTML-based textbooks in a domain A manual mapping by experts is used as a golden standard 19 Linking Linking Project 2 Guerra, J., Sosnovsky, S., & Brusilovsky, P. (2013). When one textbook is not enough: Linking multiple textbooks using probabilistic topic models. In D. Hernández-Leo, T. Ley, R., Klamma, & A. Harrer (Eds.), Proceedings of EC-TEL'2013: 8th European Conference on Technology Enhanced Learning (pp. 125-138). Berlin/Heidelberg, Germany: Springer.
  19. 19. Interlingua: linking textbooks across languages Statistics ontology .... .... .... ! Semantic model of the textbook Project 3 DE Chapter1 Section1.1 Subsection1.1.1 Subsection1.1.2 … Section1.2 Subsection1.2.2 … term -> page# term -> page# term -> page# term -> page# term -> page# …. .... .... .... EN .... .... .... FR Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Online, July 6, 2020.
  20. 20. Relevant Content in One’s Mother Tongue Project 3
  21. 21. intextbooks Isaac Alpizar Chacon Alpizar-Chacon, I., & Sosnovsky, S.(2020). Knowledge models from PDF textbooks. New Review of Hypermedia and Multimedia, (in press).
  22. 22. Model extraction from PDF textbooks 24 PDF as the most common and challenging format 4 stages 9 steps 39 rules Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings of DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.
  23. 23. 25 Example Rule • REPEATED_LINES: 1. Select a sample of pages: 𝑃 𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃. 2. If the first line(s) are identical across 𝑃 𝑠 : header is detected and removed in all pages 𝑝 ∈ 𝑃. 3. If the last line(s) are identical across 𝑃 𝑠 : footer is detected and removed in all pages 𝑝 ∈ 𝑃.
  24. 24. 2. Role labeling of fragments
  25. 25. Style 1 Style 2 Style 3 Style 4 Style Font Family Font Size Font Face Font Color Occurrences 1 Liberation Sans 35 Bold Blue 3 2 Liberation Sans 18 Bold Blue 1 3 Liberation Sans 9 - Black 153 4 Liberation Sans 9 Bold Black 2 => Body text Chapter Subchapter 2. Role labeling of fragments
  26. 26. 3. Processing Table of Contents
  27. 27. TOC Section Textbook Part Chapter Subchapter Subchapter level 2 Subchapter Subchapter Subchapter Chapter . . . . . . . . . . . Individual page numbers for each section Subchapter level 2 Subchapter level 2 Subchapter level 2 Subchapter level 2 Subchapter level 2 3. Processing Table of Contents
  28. 28. 3. Processing Index
  29. 29. Multi-column layout Index Section Index term + page number "see" case Multiline term Nested Term Range of page numbers Reading order = 3. Processing Index
  30. 30. 32 Structure (sections) Content (words, lines, etc.) Domain Knowledge (terms) 4. Textbook model
  31. 31. Potential Problems of These Models • Structure • Labels • Order • Focus • Coverage Variability 33 • Same domain + Different authors = Different textbooks => Different models Subjectivity • Completeness • Granularity • Consistency Quality • More structure than knowledge • Lack of links • Cohesiveness of topics and index terms Lack of semantics Textbook-levelModel-level
  32. 32. ..nevertheless • They are automatically extracted models of high- quality resources and underlying domains • Their individual quality might be not enough, but they can be aggregated • Linking models to the existing ontologies should help filter our less relevant terms and extend them with additional semantical information • Interlinking multiple models within the same domain should improve the coverage 34
  33. 33. 35 Evaluation 1 (Accuracy of model extraction) Domains: Statistics, Computer Science, History, Literature
  34. 34. 36 Evaluation 1 (Accuracy of model extraction): Results Averages over all domains Text Extraction Our approach: 93.85% PDFBox: 89.72% PdfAct: 84.19% TOC Recognition Precision: 99.92% Recall: 99.92% Index Recognition Precision: 98.56% Recall: 98.13%
  35. 35. 37 Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks) Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4 Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4
  36. 36. 38 Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks): Method • Ground truth • Average of manual linking of two textbooks by three experts in statistics • Measure: • NDCG (normalized discounted cumulative gain) at 1, 3, and 5. • Baselines: • TFIDF model • LDA model
  37. 37. 39 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 NDCG@1 NDCG@3 NDCG@5 TFIDF LDA TFIDF+LDA Our model Evaluation 2 (Value of Extracted Models – Semantic Linking of Textbooks): Results
  38. 38. Model linking to 40 Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM Hypertext’2019: 30th International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press.
  39. 39. 1. Construction of the Glossary 41 a) Index parsing b) Term recognition c) Glossary creation • Preparation for the next phase D .. Distribution Gamma Normal … Index Glossary terms (with candidate labels) Distribution 85 Gamma Distribution 106 Distribution Gamma Normal Distribution 92 Distribution Normal 92 106 Distribution 85 Gamma Distribution 106 Normal Distribution 92
  40. 40. • We use index terms to query DBpedia => find matching resources • DBpedia resources can have categories (e.g. Statistics) • Categories form hierarchy (e.g., Statistics / Statistical_models / ...) • In the beginning, we select the target top category (define the domain) • The algorithm looks 2 more levels deeper • This is the only manual input required • If a query retrieves only 1 DBpedia resource and it belongs to one of target categories (dct:subject) this resource becomes the part of the core set • dbo:abstract’s of all core set resources are concatenated to form domain context (used at Step 2.c) 2.a Core set construction 42
  41. 41. 2.b Candidate set construction • If a query retrieves several DBpedia resources they form the candidate set of the term • Context is gathered for every candidate resource: • dbo:abstract of this resource + • dbo:abstract’s of all resources linked to it • Context helps during the next step 43
  42. 42. 2.c Resource disambiguation • For each resource from a candidate set • Cosine similarity is computed between the context of the resource and the domain context • The resource with the highest cosine similarity (and > threshold) is matched to the term • Newly obtained resources help to extend the domain context • Step 2.3 repeats until no more new terms can be matched 44
  43. 43. 3. Model Enrichment • Abstract • Wikipedia link • Categories • Relation to other terms • Multilingual information • … 45 In statistics, the standard score is the (signed) number of standard deviations an observation or… standard score En probabilités et statistiques, une variable centrée réduite est une variable aléatoire… Unter Standardisierung oder z-Transformation versteht man in der mathematischen Statistik eine … Statistical Ratios http://en.wikipedia.org/wiki/Standard_score dct:subject FR DE EN t-statistics dct:subject …… yago:WikicatStatisticalRatios rdf:type
  44. 44. 4603-12-2020 TEI Textbook Model Structure (sections) Content (words, lines, titles, etc) Domain Knowledge (terms) + RDFa attributes
  45. 45. Evaluation: Linking to DBpedia • Question: Are the index terms linked to the right DBpedia resources? • Task: validate the resources disambiguation procedure • BL1 (random baseline): a random resources in the candidates list is selected as the right resource • BL2 (default sense baseline): the most linked/popular resource in the candidate list is selected as the right resource • Ground truth was created manually 47 Statistics#1 Statistics#2 Information Retrieval
  46. 46. Evaluation: Aggregation of Models • Question: Would aggregation of additional textbooks move the model closer to the ideal domain model (all relevant resources)? • Ground truth: constructed based on the Glossary of statistical terms • > 1000 terms • Task: compare the matching between textbooks and DBpedia with the “ideal” matching between the Glossary and the DBpedia 48 Average single textbook Average 5 textbooks 10 textbooks
  47. 47. Transformation of PDF textbooks into interactive HTML Structure (sections) Content (words, lines, titles, etc) Domain Knowledge (terms) + RDFa attributes Alpizar-Chacon, I., van der Hart, M., Wiersma, Z., Theunissen, L., & Sosnovsky, S. (2020). Transformation of PDF Textbooks into Interactive Educational Resources. In Proceedings of the Workshop on Intelligent Textbooks at AIEd'2020 (pp. 4-16). Onlines, July 6, 2020.
  48. 48. 5003-12-2020 PDF to HTML converter • Several open libraries available: • pdf2htmlEX, PDFMiner, pdf2html, Xpdf, etc. • pdf2htmlEX: • preserves the layout perfectly across very different types of documents • produces the same structure across different documents • fast, stable, and scalable
  49. 49. 5103-12-2020 TEI-HTML synchronizer
  50. 50. 5203-12-2020 TEI-HTML synchronizer
  51. 51. 5303-12-2020 Validation Test the accuracy of the matching algorithm for the TEI-HTML synchronization 70 university-level textbooks domains: statistics, computer science, web programming, literature, history evaluation metric: percentage of words that were matched between the TEI and HTML representations Results: 87-90 %
  52. 52. Current Work (1): Extraction of accurate domain models from textbook indices • Index entries have different roles (different domain specificity): - introduce core domain terms <hypotheses testing> - introduce related domain terms <factorial>, <sample space> - serve various pedagogically purposes (examples, use-cases, data, etc.) <Euro coin>, <Bovine Spongiform Encephalopathy> 54
  53. 53. Current Work (1): Extraction of accurate domain models from textbook indices Approach: 1. Use DBPedia to infer the domain specificity of matched index terms 2. Utilise DBPedia structure (categories and resources) and associated textual content 3. Integrate indices from multiple textbooks to discover a " better” domain model Domains: 1. Statistics 2. Classic Philosophy 55
  54. 54. Current Work (2): From tables of contents to topics • Add rules for filtering out non-topical sections / TOC entries • Explore how hierarchy, order and labels of topics can help domain model extraction • Create a global table of contents of the domain from multiple textbooks • Personalised textbook generation 56
  55. 55. Current Work (3): assessment generation • Use the rich intextbooks models (structured textual content annotated with domain models, linked to DBPedia, linked to other textbooks) to • generate self-assessment questions on demand • targeting a specific subset of the model/content - adaptive assessment generation 57
  56. 56. Thank you! https://github.com/intextbooks/ITCore https://intextbooks.science.uu.nl Contact: Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>