SlideShare a Scribd company logo
1 of 18
Model of the Month Club
Meeting 1:
What is a model in DH?
Example: Underwood et al., Understanding Genre
Essentially, all models are wrong, but some
are useful.
--George Box
statistician
1919-1913
What’s a model? (broadest definition)
A model is a simplified representation of something, and in
principle models can be built out of words, balsa wood, or
anything you like. In practice... statistical models are often
equations that describe the probability of an association
between variables.
Ted Underwood, Seven ways humanists are using computers to understand text.
http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
What kinds of models are we looking at in this
workshop?
Ted Underwood, Seven ways humanists are using computers to understand text.
http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
What kinds of model are we looking at today?
Ted Underwood, Seven ways humanists are using computers to understand text.
http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
Understanding Genre in a Collection of a Million
Volumes (Underwood et al.)
Problem: Classification
Why is this a problem?
1) HathiTrust has poor genre metadata.
2)Volumes are generically heterogeneous.
Desired result: provide a way of sorting HathiTrust text data to
make it useful for literary scholars
Classification as a form of machine learning
In general:
Data → training data → predictive classifier (model) → prediction → evaluation
This project:
Text → coded text → regularized logistic regression → prediction → 93.9% accurate
+
hidden Markov smoothing
Data: Text
We began by obtaining full text of all public-domain English-language
works in HathiTrust between 1700 and 1922. Organizing a group of five readers,
we asked them to label individual pages in a total of 414 books; this produced
our training data. We transformed the text of all the books into counts of fea-
tures on each page; most of these features were words that we counted, but we
also recorded other details of page structure.
Underwood: Understanding Genre Interim Report
Background: Bag of words
Training data: Coded text
223 volumes were tagged by five people, with assigned volume lists over-
lapping so that almost all the pages in the volumes were read by at least two
readers (and some by three). This strategy allowed us to make tentative esti-
mates of human dissensus, which were invaluable. But it was a relatively slow
process, because it required coordination. The remaining 191 volumes were
simply tagged at the page level by the PI. In cases where we had three readers,
we resolved human disagreements by voting. In other cases, we accepted the
more general genre tag, or the tag produced by more experienced readers.
But:
Selection of volumes: was probably the most questionable aspect of our methodology,
and an area we will give more attention as we expand into the twentieth century.
Classification: Feature engineering
We used 1062 features in our models. 1036 of them were words, or word cate-
gories; a full list is available on Github: https://github.com/tedunderwood/
genre/blob/master/data/biggestvocabulary.txt. In general, we selected
features by grouping pages into the categories we planned to classify. We took
the top 500 words from each category, and then grouped the words from all
categories into a master list that we could limit to the top N most frequent
words. This ensured that our list contained words like “8vo” and “ibid” that
might be uncommon in the whole corpus, but extremely dispositive as clues
about a particular class of pages. We normalized everything to lowercase (after
counting certain forms of capitalization as “structural features”) and truncated
final apostrophe-s.
Classification: Regularized Logistic Regression
Once we had designed this overall workflow, it was possible to plug dif-
ferent classification algorithms into the page-level classification step of the
process. We tried a range of algorithms here, including random forests and
support vector machines. We also tried a range of different ensemble strate-
gies, including strategies that combine multiple algorithms, before settling on
an ensemble of regularized logistic models, trained by comparing each genre
to all the other genres collectively.
Regularized Logistic Regression
name for a kind of classification algorithm: a set of assumptions & mathematical
processes designed to predict the likelihood that a given set of features occurring
on a single page mean that page belongs to a specific genre
in general:
calculating the odds that, given the presence of a particular feature/set of
features, a certain class is likely compared to the odds of instance being that
class without those features
does not need or assume linear relationship between variables
does not assume a distribution
Example
http://courses.washington.edu/css490/2012.Winter/lecture_slides/05b_logistic_regression.pdf
+ a Hidden Markov Model
assumes probability affected by immediate prior in a sequence & that this is a
hidden state (something external influencing instance probability)
used in this case to try to incorporate the fact that the genre of the volume has
something to do with the volume of the page
From project:
There are a variety of clever approaches that might be tried to coordi-
nate page-level predictions with knowledge of volume structure. We trained
a hidden Markov model, which is is a relatively simple approach. The model
contains information about the probability of transition from one genre to
another, so it is in a sense a model of volume structure. But in practice, its
main effect is to smooth out noisy single-page errors—for instance, it was good
at catching a few isolated pages misclassified as nonfiction in the middle of a
Evaluation
Result
https://sharc.hathitrust.org
/genre
Discussion

More Related Content

What's hot

Development of Semantic Web based Disaster Management System
Development of Semantic Web based Disaster Management SystemDevelopment of Semantic Web based Disaster Management System
Development of Semantic Web based Disaster Management SystemNIT Durgapur
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
Annotated Text feature in Elasticsearch
Annotated Text feature in ElasticsearchAnnotated Text feature in Elasticsearch
Annotated Text feature in ElasticsearchYann Cluchey
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Morgan Briles
 

What's hot (6)

Development of Semantic Web based Disaster Management System
Development of Semantic Web based Disaster Management SystemDevelopment of Semantic Web based Disaster Management System
Development of Semantic Web based Disaster Management System
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
Annotated Text feature in Elasticsearch
Annotated Text feature in ElasticsearchAnnotated Text feature in Elasticsearch
Annotated Text feature in Elasticsearch
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 

Viewers also liked

TIK Kelas 9 Bab 3
TIK Kelas 9 Bab 3TIK Kelas 9 Bab 3
TIK Kelas 9 Bab 3Dianovitaw
 
Dorce school projects
Dorce school projectsDorce school projects
Dorce school projectsEUROPAGES
 
Facebook Marketing 101
Facebook Marketing 101Facebook Marketing 101
Facebook Marketing 101Angela Frizell
 
Brain Fog 1.16.16 Final
Brain Fog  1.16.16 FinalBrain Fog  1.16.16 Final
Brain Fog 1.16.16 FinalCarrie Reynoso
 
Genkei snacks continued
Genkei snacks continuedGenkei snacks continued
Genkei snacks continuedRaquel808
 
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"ec2e2n
 
UNIRATE24 партнерская программа
UNIRATE24 партнерская программаUNIRATE24 партнерская программа
UNIRATE24 партнерская программаEBC
 
BuildBlock Insulating Concrete Form (ICF) Construction in 22 Steps
BuildBlock Insulating Concrete Form (ICF) Construction in 22 StepsBuildBlock Insulating Concrete Form (ICF) Construction in 22 Steps
BuildBlock Insulating Concrete Form (ICF) Construction in 22 StepsBrian Corder
 
TIK Kelas 9 Bab 5
TIK Kelas 9 Bab 5TIK Kelas 9 Bab 5
TIK Kelas 9 Bab 5nvitaeka
 
Organizational effectiveness
Organizational effectivenessOrganizational effectiveness
Organizational effectivenessgaurav jain
 
Enterprise workshops jira security and permissions management atlassian deck
Enterprise workshops jira security and permissions management atlassian deckEnterprise workshops jira security and permissions management atlassian deck
Enterprise workshops jira security and permissions management atlassian deckAtlassian
 
Kansen voor Wikipedia-fotodag in de Spoozone in Tilburg
Kansen voor Wikipedia-fotodag in de Spoozone in TilburgKansen voor Wikipedia-fotodag in de Spoozone in Tilburg
Kansen voor Wikipedia-fotodag in de Spoozone in TilburgOlaf Janssen
 
CT site visit report
CT site visit report CT site visit report
CT site visit report Yap Xin
 
Computer Network notes (handwritten) UNIT 1
Computer Network notes (handwritten) UNIT 1Computer Network notes (handwritten) UNIT 1
Computer Network notes (handwritten) UNIT 1NANDINI SHARMA
 
Construction Materials site visit report
Construction Materials site visit report Construction Materials site visit report
Construction Materials site visit report Batoul Alshamali
 

Viewers also liked (20)

Black berry curve 9300
Black berry curve 9300Black berry curve 9300
Black berry curve 9300
 
TIK Kelas 9 Bab 3
TIK Kelas 9 Bab 3TIK Kelas 9 Bab 3
TIK Kelas 9 Bab 3
 
Dorce school projects
Dorce school projectsDorce school projects
Dorce school projects
 
Hitler
HitlerHitler
Hitler
 
Facebook Marketing 101
Facebook Marketing 101Facebook Marketing 101
Facebook Marketing 101
 
Brain Fog 1.16.16 Final
Brain Fog  1.16.16 FinalBrain Fog  1.16.16 Final
Brain Fog 1.16.16 Final
 
Genkei snacks continued
Genkei snacks continuedGenkei snacks continued
Genkei snacks continued
 
Daniel Ruiz CV
Daniel Ruiz CVDaniel Ruiz CV
Daniel Ruiz CV
 
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"
Aleksandra Gejdel Odbiorca Ogólny "Kobiety w świecie chemii i nauki"
 
Psaradiko
PsaradikoPsaradiko
Psaradiko
 
Contes des landes
Contes des landesContes des landes
Contes des landes
 
UNIRATE24 партнерская программа
UNIRATE24 партнерская программаUNIRATE24 партнерская программа
UNIRATE24 партнерская программа
 
BuildBlock Insulating Concrete Form (ICF) Construction in 22 Steps
BuildBlock Insulating Concrete Form (ICF) Construction in 22 StepsBuildBlock Insulating Concrete Form (ICF) Construction in 22 Steps
BuildBlock Insulating Concrete Form (ICF) Construction in 22 Steps
 
TIK Kelas 9 Bab 5
TIK Kelas 9 Bab 5TIK Kelas 9 Bab 5
TIK Kelas 9 Bab 5
 
Organizational effectiveness
Organizational effectivenessOrganizational effectiveness
Organizational effectiveness
 
Enterprise workshops jira security and permissions management atlassian deck
Enterprise workshops jira security and permissions management atlassian deckEnterprise workshops jira security and permissions management atlassian deck
Enterprise workshops jira security and permissions management atlassian deck
 
Kansen voor Wikipedia-fotodag in de Spoozone in Tilburg
Kansen voor Wikipedia-fotodag in de Spoozone in TilburgKansen voor Wikipedia-fotodag in de Spoozone in Tilburg
Kansen voor Wikipedia-fotodag in de Spoozone in Tilburg
 
CT site visit report
CT site visit report CT site visit report
CT site visit report
 
Computer Network notes (handwritten) UNIT 1
Computer Network notes (handwritten) UNIT 1Computer Network notes (handwritten) UNIT 1
Computer Network notes (handwritten) UNIT 1
 
Construction Materials site visit report
Construction Materials site visit report Construction Materials site visit report
Construction Materials site visit report
 

Similar to Temple University Digital Scholarship Center: Model of the Month Club: September 2015

Semantic engagement handouts
Semantic engagement handoutsSemantic engagement handouts
Semantic engagement handoutsSTIinnsbruck
 
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdf
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdfArtificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdf
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdfNho Vĩnh
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagementSTIinnsbruck
 
Probabilistic Topic Models
Probabilistic Topic ModelsProbabilistic Topic Models
Probabilistic Topic ModelsSteve Follmer
 
Concepts as Action-Oriented as 'Search'
Concepts as Action-Oriented as 'Search'Concepts as Action-Oriented as 'Search'
Concepts as Action-Oriented as 'Search'mahmad
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)Svitlana volkova
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic WebRob Paok
 
GLIT 6757 (PEI) Winter 2012: Seminar 2
GLIT 6757 (PEI) Winter 2012: Seminar 2GLIT 6757 (PEI) Winter 2012: Seminar 2
GLIT 6757 (PEI) Winter 2012: Seminar 2Michele Knobel
 
GLIT mississauga, Seminar 2
GLIT  mississauga, Seminar 2GLIT  mississauga, Seminar 2
GLIT mississauga, Seminar 2Michele Knobel
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Artificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3edArtificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3edRohanMistry15
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...Lê Anh Đạt
 

Similar to Temple University Digital Scholarship Center: Model of the Month Club: September 2015 (20)

Semantic engagement handouts
Semantic engagement handoutsSemantic engagement handouts
Semantic engagement handouts
 
RM Q1_Q15_Q16_Q17.docx
RM Q1_Q15_Q16_Q17.docxRM Q1_Q15_Q16_Q17.docx
RM Q1_Q15_Q16_Q17.docx
 
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdf
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdfArtificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdf
Artificial_Intelligence__A_Modern_Approach_Nho Vĩnh Share.pdf
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Probabilistic Topic Models
Probabilistic Topic ModelsProbabilistic Topic Models
Probabilistic Topic Models
 
Concepts as Action-Oriented as 'Search'
Concepts as Action-Oriented as 'Search'Concepts as Action-Oriented as 'Search'
Concepts as Action-Oriented as 'Search'
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
GLIT 6757 (PEI) Winter 2012: Seminar 2
GLIT 6757 (PEI) Winter 2012: Seminar 2GLIT 6757 (PEI) Winter 2012: Seminar 2
GLIT 6757 (PEI) Winter 2012: Seminar 2
 
GLIT mississauga, Seminar 2
GLIT  mississauga, Seminar 2GLIT  mississauga, Seminar 2
GLIT mississauga, Seminar 2
 
GCRD 6353: Seminar 2
GCRD 6353: Seminar 2GCRD 6353: Seminar 2
GCRD 6353: Seminar 2
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Artificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3edArtificial Intelligence - A modern approach 3ed
Artificial Intelligence - A modern approach 3ed
 
Study_Report
Study_ReportStudy_Report
Study_Report
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
Oss swot
Oss swotOss swot
Oss swot
 

Recently uploaded

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 

Recently uploaded (20)

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 

Temple University Digital Scholarship Center: Model of the Month Club: September 2015

  • 1. Model of the Month Club Meeting 1: What is a model in DH? Example: Underwood et al., Understanding Genre
  • 2. Essentially, all models are wrong, but some are useful. --George Box statistician 1919-1913
  • 3. What’s a model? (broadest definition) A model is a simplified representation of something, and in principle models can be built out of words, balsa wood, or anything you like. In practice... statistical models are often equations that describe the probability of an association between variables. Ted Underwood, Seven ways humanists are using computers to understand text. http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
  • 4. What kinds of models are we looking at in this workshop? Ted Underwood, Seven ways humanists are using computers to understand text. http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
  • 5. What kinds of model are we looking at today? Ted Underwood, Seven ways humanists are using computers to understand text. http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
  • 6. Understanding Genre in a Collection of a Million Volumes (Underwood et al.) Problem: Classification Why is this a problem? 1) HathiTrust has poor genre metadata. 2)Volumes are generically heterogeneous. Desired result: provide a way of sorting HathiTrust text data to make it useful for literary scholars
  • 7. Classification as a form of machine learning In general: Data → training data → predictive classifier (model) → prediction → evaluation This project: Text → coded text → regularized logistic regression → prediction → 93.9% accurate + hidden Markov smoothing
  • 8. Data: Text We began by obtaining full text of all public-domain English-language works in HathiTrust between 1700 and 1922. Organizing a group of five readers, we asked them to label individual pages in a total of 414 books; this produced our training data. We transformed the text of all the books into counts of fea- tures on each page; most of these features were words that we counted, but we also recorded other details of page structure. Underwood: Understanding Genre Interim Report
  • 10. Training data: Coded text 223 volumes were tagged by five people, with assigned volume lists over- lapping so that almost all the pages in the volumes were read by at least two readers (and some by three). This strategy allowed us to make tentative esti- mates of human dissensus, which were invaluable. But it was a relatively slow process, because it required coordination. The remaining 191 volumes were simply tagged at the page level by the PI. In cases where we had three readers, we resolved human disagreements by voting. In other cases, we accepted the more general genre tag, or the tag produced by more experienced readers. But: Selection of volumes: was probably the most questionable aspect of our methodology, and an area we will give more attention as we expand into the twentieth century.
  • 11. Classification: Feature engineering We used 1062 features in our models. 1036 of them were words, or word cate- gories; a full list is available on Github: https://github.com/tedunderwood/ genre/blob/master/data/biggestvocabulary.txt. In general, we selected features by grouping pages into the categories we planned to classify. We took the top 500 words from each category, and then grouped the words from all categories into a master list that we could limit to the top N most frequent words. This ensured that our list contained words like “8vo” and “ibid” that might be uncommon in the whole corpus, but extremely dispositive as clues about a particular class of pages. We normalized everything to lowercase (after counting certain forms of capitalization as “structural features”) and truncated final apostrophe-s.
  • 12. Classification: Regularized Logistic Regression Once we had designed this overall workflow, it was possible to plug dif- ferent classification algorithms into the page-level classification step of the process. We tried a range of algorithms here, including random forests and support vector machines. We also tried a range of different ensemble strate- gies, including strategies that combine multiple algorithms, before settling on an ensemble of regularized logistic models, trained by comparing each genre to all the other genres collectively.
  • 13. Regularized Logistic Regression name for a kind of classification algorithm: a set of assumptions & mathematical processes designed to predict the likelihood that a given set of features occurring on a single page mean that page belongs to a specific genre in general: calculating the odds that, given the presence of a particular feature/set of features, a certain class is likely compared to the odds of instance being that class without those features does not need or assume linear relationship between variables does not assume a distribution
  • 15. + a Hidden Markov Model assumes probability affected by immediate prior in a sequence & that this is a hidden state (something external influencing instance probability) used in this case to try to incorporate the fact that the genre of the volume has something to do with the volume of the page From project: There are a variety of clever approaches that might be tried to coordi- nate page-level predictions with knowledge of volume structure. We trained a hidden Markov model, which is is a relatively simple approach. The model contains information about the probability of transition from one genre to another, so it is in a sense a model of volume structure. But in practice, its main effect is to smooth out noisy single-page errors—for instance, it was good at catching a few isolated pages misclassified as nonfiction in the middle of a