SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Stylometry of literary papyri
Holger Essler, Jeremi K. Ochab
Institute of Physics
Jagiellonian University
DATeCH 2019
10th May 2019 Brussels
Questions&Aims
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Data
Metadata
Processing
Data
Data
Data
https://github.com/DCLP/idp.data/tree/dclp/DCLP
Data
10
14624
metadata
Data
14624
metadata
748 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions (paraliterary)
298 transcriptions
748 transcriptions
Data
14624
metadata
Data: metadata
14624
metadata
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
Data: metadata
14624
metadata
298 transcriptions
www.trismegistos.org
/place/2722
/authorwork/3062
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
Philodemus
Single-text authors
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: cleaning
14624
298 transcriptions
http://papyri.info/docs/leiden_plus
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
298 transcriptions
Two strategies:
v diversifying: by retaining <orig>, <hi>,
but omitting <reg> and <ex>
v normalising: by omitting <orig>, <hi>,
but retaining <reg> and <ex>
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Methods
Distance-based clustering
Community detection in networks
Clustering quality measures
Distance-based clustering
Compute text similarity
Distance-based clustering
Compute text similarity » word frequencies
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchically cluster (unsupervised)
› single, complete, …
› Ward linkage
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchicaly cluster (unsupervised)
› single, complete, …
› Ward linkage
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
› Louvain
(modularity)
› Informap
› OSLOM
› …
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
› Louvain
(modularity)
› Informap
› OSLOM
› …
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
› Louvain
(modularity)
› Informap
› OSLOM
› …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index
› mutual information
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
› some selection bias remaining
(number and size of clusters)
Results
Results
It is hard!
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
› Which similarity measure
» Burrows’s delta: AMI<0.1 (terrible)
» cosine delta: AMI=0.25 (very low, ~0.6 in novels)
» number of clusters: 15-25 (close)
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
Results
Results
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
Problems:
mbalanced data
text sizes
Outlook:
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
› Outlook
o N-grams + SVD to circumvent sparseness
o augment texts preserved by medieval transmission
o supervised ML to narrow down:
genre/text type, dates, places, …
o Documentary papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
H Essler
S Pielström
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019.
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
Thank
you!
Questions?
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
H Essler
S Pielström
58
Thank
you!
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Session6 02.jeremi ochab

AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoAdam Perzynski, PhD
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...paper_reader
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Anita de Waard
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsAustin Benson
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserDavid Dias
 

Ähnlich wie Session6 02.jeremi ochab (20)

AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing Costs
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivo
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifs
 
Tianpei research summary
Tianpei research summaryTianpei research summary
Tianpei research summary
 
tianpei_research_summary
tianpei_research_summarytianpei_research_summary
tianpei_research_summary
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the Browser
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Session6 02.jeremi ochab

  • 1. Stylometry of literary papyri Holger Essler, Jeremi K. Ochab Institute of Physics Jagiellonian University DATeCH 2019 10th May 2019 Brussels
  • 3. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › …
  • 4. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 5. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 15. Data: metadata 14624 metadata 748 transcriptions • Greek • known author • >50 words
  • 16. Data: metadata 14624 metadata 298 transcriptions 748 transcriptions • Greek • known author • >50 words
  • 19. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 20. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 21. Data: metadata 14624 metadata 298 transcriptions Philodemus Single-text authors • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 22. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 23. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 24. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 25. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 27. Data: cleaning 14624 298 transcriptions Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 28. Data: cleaning 14624 298 transcriptions Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 29. Data: cleaning 298 transcriptions Two strategies: v diversifying: by retaining <orig>, <hi>, but omitting <reg> and <ex> v normalising: by omitting <orig>, <hi>, but retaining <reg> and <ex> Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 30. Methods Distance-based clustering Community detection in networks Clustering quality measures
  • 32. Distance-based clustering Compute text similarity » word frequencies
  • 33. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 34. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchically cluster (unsupervised) › single, complete, … › Ward linkage Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 35. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchicaly cluster (unsupervised) › single, complete, … › Ward linkage J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 36. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 37. Community detection in networks › Louvain (modularity) › Informap › OSLOM › … MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 38. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. › Louvain (modularity) › Informap › OSLOM › …
  • 39. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › Louvain (modularity) › Informap › OSLOM › …
  • 40. Clustering quality measures Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, …
  • 41. Clustering quality measures Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index › mutual information
  • 42. Clustering quality measures Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised
  • 43. Clustering quality measures Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised › some selection bias remaining (number and size of clusters)
  • 46. Results › Best network clustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 47. Results › Best network clustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 › Which similarity measure » Burrows’s delta: AMI<0.1 (terrible) » cosine delta: AMI=0.25 (very low, ~0.6 in novels) » number of clusters: 15-25 (close) Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 51. Results Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 52. Results Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 53. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) Problems: mbalanced data text sizes Outlook: N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 54. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 55. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small › Outlook o N-grams + SVD to circumvent sparseness o augment texts preserved by medieval transmission o supervised ML to narrow down: genre/text type, dates, places, … o Documentary papyri
  • 56. J Rybicki Institute of English Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk H Essler S Pielström References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019. › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri
  • 57. J Rybicki Institute of English Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk Thank you! Questions? References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri H Essler S Pielström