SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Beatrice Alex
balex@inf.ed.ac.uk
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Digital history and big data:
Text mining historical documents on trade in
the British Empire
Overview
What is text mining?
Text Mining in digital history
Trading Consequences
“Big data”
Visualisation
Challenge of noisy data
Collaborating with historians
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Text Mining
Describes a set of linguistic, statistical and
machine learning techniques that model and
structure the information content of textual
resources.
Turns unstructured text into structured data (e.g.
relational database or linked data).
Is very useful for analysing large text collections
automatically.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Text Mining
TM methods often rely on a set of linguistic pre-
processing steps such as tokenisation, sentence
detection, part-of-speech tagging, lemmatisation,
syntactic parsing (chunking).
Currently our focus is on named entity
recognition, entity grounding and relation
extraction.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
TM in Digital History
Goal: By analysing large amounts of digitised
data, help historians to discover novel patterns
and explore hypothesis.
Methods: linguistic text analysis, named entity
recognition, geo-grounding and relation extraction
to transform the text into structured data.
Sea-change to methods used in ‘traditional’
history.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
“Traditional” Historical
Research
Cinchona plantations in George King’s A Manual of
Cinchona Cultivation in India (1880).
Global Fats Supply 1894-98
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Trading Consequences
Digging into Data II project (till Dec. 2013)
Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,
Dr. Claire Grover, Clare Llewellyn, Richard Tobin,
James Reid, Nicola Osborne, Ian Fieldhouse
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
TRADING CONSEQUEnCES
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Trading Consequences
What does archival text say about the economic
and environmental consequences of global
commodity trading during the nineteenth century?
Scope: global, but with focus on Canadian natural
resources.
Example questions:
‣ What were the routes and volumes of international trade in
resource commodities in the nineteenth century?
‣ What were the local environmental consequences of this
demand for these resources?
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Document Collections
Big data for historians:
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Extracted entities:
commodity: cassia bark
date: 1871
location: Padang
location: America
quantity + unit: 6,127 piculs
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Normalised and grounded entities:
commodity: cassia bark
date: 1871 (year=1871)
location: Padang (lat=-0.94924;long=100.35427;country=ID)
location: America (lat=39.76;long=-98.50;country=n/a)
quantity + unit: 6,127 piculs
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Extracted entity attributes and relations:
origin location: Padang
destination location: America
commodity–date relation: cassia bark – 1871
commodity–location relation: cassia bark – Padang
commodity–location relation: cassia bark – America
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Commodity Ontology
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Noisy Data
Optical character recognition contains many errors
and often the structure of the page layout is lost.
Sophistication of the OCR engine and scanning equipment.
Quality of the original print and paper.
Use of historical language.
Information in page margins (header, page numbers, etc.).
Information in tables.
Language of the text.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Text normalisation and correction:
End-of-line soft hyphen removal
Dehyphen all token-splitting hyphens using a dictionary-based
approach.
“False f”-to-s conversion
Convert all false f characters to s using a corpus.
Example: reduced number of words unrecognised by
spell checker from 61 to 21 -> 67%, on average 12%
reduction in word error rate in a random sample (Alex et
al, 2012).
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Extract from document
10.2307/60238580 in FCOC.
How Noisy Is Too Noisy?
qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3
sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj
ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj
qS;H °1 ssbui s.uauuaqsu aqx
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
The Users (Historians)
Involvement of historians:
Everything is based on the use cases and build on users’
hypotheses/research questions.
They are responsible for identification of relevant collections
and are involved in the ontology development.
They provide feedback for us to improve technology iteratively:
Partners at York use of the prototype for their research and
track errors; Workshop at CHESS 2013 with a group of
independent historians
Clarity on the text mining accuracy is IMPORTANT.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Summary
Text mining historic documents in Trading
Consequences.
Processing “big data”.
Power of visualising structured data.
Fixing noisy data.
Importance of two-way collaboration between
technology experts and users in digital history.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Thank you
Questions? Fire away or contact me at:
balex@inf.ed.ac.uk
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Weitere Àhnliche Inhalte

Ähnlich wie Digital History and Big Data: text mining historical documents on trade in the British empire

20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madrid
Stefan Gradmann
 
20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid
Stefan Gradmann
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
Jing Xie
 

Ähnlich wie Digital History and Big Data: text mining historical documents on trade in the British empire (20)

An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madrid
 
Dark Data In the Long Tail of Science:   Examples in Biology
Dark Data In the Long Tail of Science:  Examples in BiologyDark Data In the Long Tail of Science:  Examples in Biology
Dark Data In the Long Tail of Science:   Examples in Biology
 
Essays on the Smart Grid
Essays on the Smart GridEssays on the Smart Grid
Essays on the Smart Grid
 
Oregon Digital: Collaborative Hydra Development
Oregon Digital: Collaborative Hydra DevelopmentOregon Digital: Collaborative Hydra Development
Oregon Digital: Collaborative Hydra Development
 
Neil Grindley
Neil GrindleyNeil Grindley
Neil Grindley
 
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
 
20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid
 
Ways to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is ScarseWays to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is Scarse
 
Knowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly CommunicationKnowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly Communication
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British Library
 
Internet Prospective Study
Internet Prospective StudyInternet Prospective Study
Internet Prospective Study
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Data Mining for scholarship
Data Mining for scholarshipData Mining for scholarship
Data Mining for scholarship
 
Developments in catalogues and data sharing
Developments in catalogues and data sharingDevelopments in catalogues and data sharing
Developments in catalogues and data sharing
 
Realising the Potential of Algal Biomass Production through Semantic Web an...
Realising the Potential of Algal Biomass Production   through Semantic Web an...Realising the Potential of Algal Biomass Production   through Semantic Web an...
Realising the Potential of Algal Biomass Production through Semantic Web an...
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historians
 
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
 
Swib2014csarasua
Swib2014csarasuaSwib2014csarasua
Swib2014csarasua
 

KĂŒrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

KĂŒrzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Digital History and Big Data: text mining historical documents on trade in the British empire

  • 1. Beatrice Alex balex@inf.ed.ac.uk Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013 Digital history and big data: Text mining historical documents on trade in the British Empire
  • 2. Overview What is text mining? Text Mining in digital history Trading Consequences “Big data” Visualisation Challenge of noisy data Collaborating with historians Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 3. Text Mining Describes a set of linguistic, statistical and machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections automatically. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 4. Text Mining TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Currently our focus is on named entity recognition, entity grounding and relation extraction. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 5. TM in Digital History Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypothesis. Methods: linguistic text analysis, named entity recognition, geo-grounding and relation extraction to transform the text into structured data. Sea-change to methods used in ‘traditional’ history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 6. “Traditional” Historical Research Cinchona plantations in George King’s A Manual of Cinchona Cultivation in India (1880). Global Fats Supply 1894-98 Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 7. Trading Consequences Digging into Data II project (till Dec. 2013) Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex, Dr. Claire Grover, Clare Llewellyn, Richard Tobin, James Reid, Nicola Osborne, Ian Fieldhouse Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 8. TRADING CONSEQUEnCES Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 9. Trading Consequences What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? Scope: global, but with focus on Canadian natural resources. Example questions: ‣ What were the routes and volumes of international trade in resource commodities in the nineteenth century? ‣ What were the local environmental consequences of this demand for these resources? Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 10. Document Collections Big data for historians: Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 11. Mined Information Example sentence: Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 12. Mined Information Example sentence: Extracted entities: commodity: cassia bark date: 1871 location: Padang location: America quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 13. Mined Information Example sentence: Normalised and grounded entities: commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 14. Mined Information Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 15. Commodity Ontology Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 16. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 17. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 18. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 19. Noisy Data Optical character recognition contains many errors and often the structure of the page layout is lost. Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 20. Fixing Noisy Data Text normalisation and correction: End-of-line soft hyphen removal Dehyphen all token-splitting hyphens using a dictionary-based approach. “False f”-to-s conversion Convert all false f characters to s using a corpus. Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012). Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 21. Fixing Noisy Data Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 22. Fixing Noisy Data Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 23. Extract from document 10.2307/60238580 in FCOC. How Noisy Is Too Noisy? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 24. The Users (Historians) Involvement of historians: Everything is based on the use cases and build on users’ hypotheses/research questions. They are responsible for identification of relevant collections and are involved in the ontology development. They provide feedback for us to improve technology iteratively: Partners at York use of the prototype for their research and track errors; Workshop at CHESS 2013 with a group of independent historians Clarity on the text mining accuracy is IMPORTANT. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 25. Summary Text mining historic documents in Trading Consequences. Processing “big data”. Power of visualising structured data. Fixing noisy data. Importance of two-way collaboration between technology experts and users in digital history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 26. Thank you Questions? Fire away or contact me at: balex@inf.ed.ac.uk Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013