Today I will be arguing that it may hurt us in the long run to cordon-off open access from proprietary digital resources. Next Wednesday I’ll be attending a meeting with 18 th -century data owners, most notably Gale Cengage and the ESTC, one proprietary, one open access, and I need your advice and commentary on this presentation – I’ll be presenting to them what I present to you today, so feedback will help us in negotiations tremendously. The goal?
We are allegedly in the middle of a data deluge, but my question is, deluge OF WHAT? Any text originally produced before about 1830, when modern printing practices firmly took hold, isn’t exactly data. We may have great metadata, we may have great page images, but if we don’t have clean plain text, this part of our cultural record will eventually be lost. REALLY? The image of me here on Youtube comes from a recent interview at the Digital Humanities Conference, and I’m sure 99% of the participants at that conference would think I was crazy to say “we don’t have any data.” But really, it will be the crunchable data, the texts that can be mined, that will be saved – I’m imagining myself a voice crying in the wilderness on this one.
Dino has just described for us the Networked Infrastructure for nineteenth-century electronic scholarship, and I will be discussing 18thConnect, a similar scholarly community and research environment for
the eighteenth century that has been spawned by NINES. Like NINES, 18thConnect will serve
to guide and sustain the remastering of our cultural heritage in digital media and to insist that access to it is an intellectual right.
18thConnect will perform the same functions performed by NINES:
it will provide peer review
of digital projects created by scholars. Blake and Whitman were the first projects to be peer-reviewed by NINES.
18Connect has a similarly stellar project with which to open its doors, the Old Bailey Online, and its sister project, Plebeian Lives, forthcoming.
NINES puts a sign of peer review on every member site, at the same time providing immediate access to the NINES online finding aid from any member site.
NINES is a federated publisher, aggregating electronic scholarship
while leaving it all in the hands of its developers;
In order to provide a comprehensive research environment for scholars, and in order to make these digital projects interoperable within the universe of scholarship and primary materials, NINES also takes in proprietary databases, some of which you can see listed here on the right.
NINES/18thConnect are aggregators of data, not possessors of it. JSTOR, ProjectMuse, Proquest, and Alexander Street Press give us metadata. Only users whose libraries subscribe to these proprietary resources get access to them.
The18thConnect/NINES interface sends the user directly to the live records, updated and controlled by the proprietor, according to a permanent identification number associated with all metadata and plain text files.
We ask those who participate in NINES to give us plain text files of all their digital resources. The NINES SOLR indexer crawls those files, tokenizes the words, and then, when the word is searched, returns a portion of its context, as you can see here. But no resource is reconstructable from that word index, so proprietary data is safe.
The index contains all the uses of the word tiger, 300 of them, say, as tokens all in a row (this is how I’m picturing it) each linked to its own snippet. You’d have to sort the whole index of 400,000 items to recreate the texts in it.
But for certain resources that scholars would really like to use, getting plain text that has been keyed is impossible, and so a big debate among members of the Executive council is whether to take OCR, often called dirty.
Plain text files for eighteenth-century texts are very problematic.
In fact, the OCR running behind these images is bad enough to have spawned a massive effort to actually transcribe – that is, type – these texts. The ECCO Text Creation Partnership
has so far re-keyed 2418 texts. Unfortunately, ECCO has not yet re-incorporated these 2418 typed texts into its database of OCR so the corrections do not yet benefit all scholars, only those at participating institutions.
As can be seen in a similar agreement made between Proquest and the Text Creation Partnership, the typing and coding work done by the participating university libraries is not distributable except to other institutions who are similarly typing and encoding until a period of exclusivity expires.
While that period is five years, in the case of ECCO, it is not very clear when the clock starts, and libraries may be unable to openly distribute the expensively typed and coded texts for up to 10 years. They also cannot distribute ECCO’s page images – any images that they don’t own – ever.
And there are currently 138,000 texts that have been digitized by Gale from that set of films and put on line into ECCO, with no immediate plans to keep going – digitizing from the microfilms has stopped.
based on information from the Eighteenth Century Short Title Catalogue. So the information flow goes roughly like this: there are now over 400,000 entries in the ESTC – now become the “English Short Title Catalog”; there are 200,000 texts in the Microfilm series in roughly 11,000 rolls of film.
One gets only a few more hits on the word “curiosity” in Clarissa using the medial “f” as opposed to the “s”
And there are a few more hits per page.
This is nonetheless a great improvement over Google’s OCR which will tell you that the 1784 edition of Richardson’s _Clarissa_ never uses the word “Curiosity,”
and, when you look at a page where you know the word is used,
it shows you something quite different: Cmiofity.
Things look better in Google books when you search an 1820 version of the text, but this is still not optimal. Anna Barbauld’s 1820 edition of the novel, having lost the long ‘s’, does much better, giving us 7 returns, but again, I’m looking at one page containing three instances. The problems specific to 18 th -century type did not disappear until the 1830s, with the advent of what print historians call “modern type,” when the punch began to be situated in its matrix according to mathematical principles.
18thConnect has just been awarded a grant from the NCSA (National Supercomputer Center) and I-CHASS (Institute for Computing in Humanities, Arts, and Social Sciences)
for supercomputer time which we will use to OCR 138,000 texts provided by Gale Cengage from the ECCO catalogue. We will meet with Gale on July 15, 2009, to discuss how this will work.
18thConnect is just at the beginning of a process that the Old Bailey has successfully completed. The Old Bailey Online project double-keyed everything before 1834, and only on proceedings published after that date did they work by correcting OCR.
Part-of-speech tagging would enhance the capacity, but so will an improved OCR program that can handle the movement up and down of letters on a page: Gamera is of course designed to read notes which move up and down a page, and it is for this reason that we are developing it instead of the OCRopus, child of Google’s release to the open-source development world of Tesseract, although we will try to incorporate some of the methods used for creating it.
Now I will depict for you the future of 18thConnect, based on the reality of NINES. As Dino has shown, you log into NINES to get to your own “My NINES” page.
Here is a mock-up of a future “My 18thConnect Page,” and every member of ASECS, BSECS, and ISECS – national and international societies for eighteenth-century studies – will be given one of these pages.
I’m relying on some work by Brad Pasanek to imagine this future, so I wanted to site his database The Mind is a Metaphor.
Here are his results, but pretend they came from a search in 18thConnect. The first return, one would click on the title and be sent to the document in Gale
, unless one’s library did not subscribe.
The second return, if we clicked on it
Would take us to the ESTC record.
Snippets of text would be returned, and scholars who don’t subscribe would be able to get the names of holding libraries from the ESTC should they needto see the text. Let me now conclude.
There are many things to praise and to critique about ECCO, the most thorough analysis having occurred recently at an MLA panel, the talks of which are available on YouTube.
I myself have attacked the ECCO catalogue in videos about Open Access resources.
And we want to do all we can to encourage projects such as Benjamin Pauley’s attempt to connect editions in google books to ESTC numbers.
But in the future we will only know the eighteenth century through the texts that can be culled from the avalanche of data, and only machine readable, crunchable data will come to the top. We need to work with commerical providers in order to make the best data possible.
We will re-OCR and then automatically tag as much of Gale’s ECCO data as possible, returning it to them better than we found it so that they can create their own tools.
We’ll use the tags that we generate to populate our own finding aids, giving scholars the opportunity to know and find what’s there, all the while helping commercial firms add value to their corpuses. Only this, I believe, will preserve for us our precious cultural heritage.