Computational History and the Transformation of Public Discourse in Finland, 1640 - 1910 (COMHIS)
1. Consortium partners:
• National Library of Finland, Centre for Preservation and Digitisation
• University of Helsinki, Faculty of Humanities
• University of Turku, Dept of Information Technology
• University of Turku, Dept of Cultural History
More info at http:// goo.gl/tMH4RE
2. Researchers:
• National Library of Finland, Centre for Preservation and Digitisation
Kimmo Kettunen (PI), Mika Koistinen, Teemu Ruokolainen
• University of Helsinki, Faculty of Humanities
Mikko Tolonen (PI), Leo Lahti, Jani Marjanen, Hege Roivainen,
Ville Vaara
• University of Turku, Dept of Information Technology
Tapio Salakoski (PI), Filip Ginter, Aleksi Vesanto
• University of Turku, Dept of Cultural History
Hannu Salmi (Consortium PI), Asko Nivala, Heli Rantala, Reetta
Sippola
4. COMHIS Overview
Work package 1: Publishing Trends and the Development of Public Discourse
WP 1.1 Large-scale Analysis of Library Catalogue Metadata Collections
WP 1.2. Intellectual Geography and Transcending of National Borders
Work Package 2: WP2 Viral Texts and Social Networks of Finnish Public Discourse in
Newspapers and Journals 1771–1910
WP 2.1: Improving the Quality of Newspaper Digital Archives
WP 2.2: Virality of Newspaper and Journal Discourse in Nineteenth-Century Finland:
Cultural Rhizomes and Social Networks
Work package 3: Data Analytical Ecosystem for Newspapers and Historical
Document Collections
WP 3.1 Quantitative Tools for Bibliographic Library Catalogue Metadata Collections
and Finnish Book Production (1488–1910)
WP 3.2 Machine learning methods for text mining
WP 3.3 Text Reuse and Paraphrasing in Finnish Newspapers and Journals, 1771–1910
WP 3.4 Open Source Statistical Workflows
5. National Library of Finland
NLF has a large digitized newspaper and journal
collection 1771-1920 (and newer)
• http://digi.kansalliskirjasto.fi
Newspapers
Digitized 4,501,147 pages.
Free use 2,954,424 pages (65%) (-1920).
Copyright based material 1,546,723 pages (35%) (1921-)
Journals
Digitized 6,378,717 pages.
Free use 2,161,748 pages (33%) ( -1920).
Copyright based material 4,216,969 pages (67%) (1921-).
12. • How much newspapers and
journals shared each others’
content?
• We have found 8 million clusters of
repeated texts in the corpus of
Finnish newspapers and journals
1771–1910, this includes a total of
49 million occurences (hits)
• Different forms of text reuse:
advertisement, notices, news,
anecdotes, poems, etc.
• Long-term reuse
• Viral chains, explosive replication
Text reuse
14. Finding text reuse
• Programme called NCBI BLAST
• Used to compare and align biological sequences, like protein sequences
• Finds all similar sub-sequence pairs
• Our data is just text, not protein sequences
• We had to encode our data into protein sequences
• 23 amino acids available
• We formed a mapping from the 23 most common letters to the available amino acids
• We encoded the data using this mapping and discarded characters that didn’t have an
equivalent
• "This is an example sentence” à “DSCHCHBEGBNQFGHGEDGEG”
• BLAST outputs all similar sub-sequences from our data
• We formed clusters by assigning all sub-sequences that overlap enough to be a cluster
15.
16. Publications
• Kimmo Kettunen, Tuula Pääkkönen: “Measuring Lexical Quality of a Historical Finnish Newspaper
Collection? Analysis of Garbled OCR Data with Basic Language Technology Tools and Means”, LREC
2016.
• Kimmo Kettunen, Eetu Mäkelä, Juha Kuokkala, Teemu Ruokolainen, Jyrki Niemi: “Modern Tools
for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection
1771-1910”, LWDA 2016: 124-135.
• Tuula Pääkkönen, Jukka Kervinen, Kimmo Kettunen, Asko Nivala, Eetu Mäkelä: “Exporting Finnish
Digitized Historical Newspaper Contents for Offline Use”, D-Lib Magazine 22(7/8) (2016).
• Mikko Tolonen, Jani Marjanen, Niko Ilomäki, Hege Roivainen and Leo Lahti, “Printing in a
Periphery: a Quantitative Study of Finnish Knowledge Production, 1640-1828”, Proceedings of
Digital Humanities 2016, long papers, Kraków, Poland, July, 2016
• Mikko Tolonen, Leo Lahti and Niko Ilomäki, “A Quantitative Analysis of History in the ESTC
catalogue”, Liber Quarterly, 25(2), pp. 87–116, 2016. DOI: http://doi.org/10.18352/lq.10112
• Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi and Filip Ginter: “A System for
Identifying and Exploring Text Repetition in Large Historical Document Corpora”, In Proceedings
of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May
2017 (Linköping 2017), 330–333, http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf
• Aleksi Vesanto, Asko Nivala, Heli Rantala, Tapio Salakoski, Hannu Salmi and Filip Ginter: “Applying
BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910”, Proceedings of
the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017
(Linköping 2017), 54–58, http://www.ep.liu.se/ecp/133/010/ecp17133010.pdf