3. Collaboration
Constantino Malagón
Associate professor of Computer Engineering
Universidad Nebrija, Spain
Justo Hidalgo
Vice-President, Denodo Technologies
Co-Founder of 24symbols
Yonsoo Kim
Assistant Professor of Spanish
School of Languages & Cultures, Purdue University
4. Collaboration
Javier Polanco - Developer
Undergraduate student at Nebrija University
Now, Computer Engineer
Carlos Martínez – Web Designer
Undergraduate student at Nebrija University
Eric Herrera – Website Testing
Undergraduate student at Purdue University
6. Introduction
Our first idea: Help researchers in
Humanities
1. Medieval documents (MMEDIS.com)
First: Transcription
Then: Search, Access, Context
2. Finally, a web portal (HIT)
7. Medieval
Document
MSS 120
Author: Gilbertus Anglicus
8.
9. Medieval documents
Automatic transcription
Abbreviations in medieval medical
documents
International Conference of Frontiers in
Handwriting Recognition (ICFHR2012)
Main peer reviewed conference
10. Medieval documents
Search and access
Hispanic Seminary:
El Corpus de Textos Médicos Españoles:
http://www.hispanicseminary.org/t&c/med/index.htm
Keyword: “medicina”
12. HIT Web portal
Visualization tool
Expand the document type to any published, digitized
format, not just medieval texts
Expand the type and number ofsources, databases and
repositories
Expand the contextual information
21. Access
API (Application Programming Interface)
– These sources provide a set of rules and
programmatic “doors” that let us interact with them
– Example: Amazon, Google Books
– Amazon, give me all info you have about the book
with ISBN=“XXXX”
22. Access
Screen Scraping
– We need to “scratch” the web page and create
structure out of it
– Example: Wikipedia
35. Conclusions
The proof of concept has shown:
- How to access heterogeneous, web-based
data sources
- How to integrate those data pieces in a single
data model
36. Conclusions
The proof of concept has shown:
- How to execute search methods among those
sources
- How to visualize this info in a meaningful,
useful way
37. Future work
To expand the list of repositories
Portal personalization
Adapt HIT to all kinds of devices
Addition of semantic capabilities
HIT is an innovative management system that allows one to administer texts, analyze information, and collate images or other sources in a comprehensive web portal, especially custom made for CLA faculties and students. With HIT, one can access and search information on the Internet held on different public repositories in the field of arts, humanities and social sciences (such as Anthropology, Communication, English, Spanish, History, Philosophy, Political Science, Sociology, Visual and Performing Arts, etc.) in a unified way.
This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish. What a professor of Spanish has to do in technology and Science?
This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish. What a professor of Spanish has to do in technology and Science?
This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish.
Constan and I have co-founded a research team called MMEDIS (Medieval Medicine Documents Identification System), where diverse interdisciplinary researchers pursue their goals to create an automatic transcription program with Artificial Intelligence. We have two essential reasons to carry out the MMEDIS project: First, we aims to analyze how medicine shaped and affected lives of the medieval people. This interest stems from my research on Teresa de Cartagena, a converted Jewish nun who became deaf and wrote religious treatises. I intend to investigate physical disabilities and diseases that inflicted pain on people in medieval Europe. Second, we plan to study and transcribe hand-written documents in more efficient ways than traditional paleographic transcription of manuscripts.
Originally composed in Latin by Gilbertus Anglicus (Gilbert the Englishman), his Compendium of Medicine was a primary text of the medical revolution in thirteenth-century Europe. Composed mainly of medicinal recipes, it offered advice on diagnosis, medicinal preparation, and prognosis. In the fifteenth-century it was translated into Middle English to accommodate a widening audience for learning and medical "secrets." For example, Faye Marie Getz provides a critical edition of the Middle English text, with an extensive introduction to the learned, practical, and social components of medieval medicine and a summary of the text in modern English. Her book entitled Healing and society in medieval England: a Middle English translation of the pharmaceutical writings of Gilbertus Anglicus. Like this type of manuscript, once that it ’ s transcribed people do not go back to the original…. Because of all the intensive work that the manuscript required.
It ’ s a tedious work and only specialist in paleography can read it. But the problem does not end here…
Also, we need to decode all the abbreviations in medieval medical documents. We have submitted and article base on the study of the handwriting recognition process.
Our transcriptions project will take some time to make it work efficiently. However, we realized that there are in the internet some websites that works on manually transcribed manuscripts. For example: Hispanic Seminary has published online 55 texts. SPANISH MEDICAL TEXTS [55 texts / 2,642,403 tokens] PREPARED BY: FRANCISCO GAGO JOVER Mª TERESA HERRERA Mª ESTELA GONZÁLEZ DE FAUVE All these texts are available but you cannot access them if you don ’ t know where to find them in the internet.
The significance and originality of the HIT project is to exemplify that knowledge should be presented beyond two-dimensional spaces such as paper (encyclopedia) or as keyword search websites (Wikipedia, Google, Yahoo, etc.). Knowledge has to be obtainable in infinitely explorative and proliferating ways in the mashup, reaching its maximum complexity. The true potential of this project is almost limitless because its integrated knowledge system can be used for research or self-learning in any field. Instead of a mere input-output model, any search and reading will lead to contextualized and integrated learning.
(DO NOT READ) The significance of this project The significance and originality of the HIT project is to exemplify that knowledge should be presented beyond two-dimensional spaces such as paper (encyclopedia) or as keyword search websites (Wikipedia, Google, Yahoo, etc.). Knowledge has to be obtainable in infinitely explorative and proliferating ways in the mashup, reaching its maximum complexity. The true potential of this project is almost limitless because its integrated knowledge system can be used for research or self-learning in any field. Instead of a mere input-output model, any search and reading will lead to contextualized and integrated learning. HIT can address the two major problems of contemporary digital humanities: overload of useless information and lack of textual context. The world of electronic communication is a world of textual overabundance in which the written texts that are offered go far beyond the reader ’ s ability to take advantage of them. Often, researchers have denounced the uselessness of the overload of information on the web. Thus, ideally, one should know where, why, and how she or he should gather the most accurate and reliable texts on the internet. This is precisely what HIT will do by organizing and synthesizing data and texts—all of them available in one single search. In the HIT project, I will research and select information available on the internet and filter out only needed and trustful information. Furthermore, HIT will analyze not only external repositories but also internal repositories, such as Purdue Library ’ s database and catalogs. The other problem facing current digital humanities is that texts, content or information are usually provided without taking into account its context. Reading in front of the computer screen is generally a discontinuous reading process that seeks, using keywords or thematic headings, the fragment that the reader wishes to find: an article in an electronic periodical, a passage in a book, or some information on a website. This is done without necessarily knowing the identity or coherence of the entire text from which the fragment was extracted. In a certain sense, one might say that in the digital world all textual entities are like databases that offer fragments, the reading of which in no way implies a perception of the work or the body of works from which they came. This explains the confusion of the contemporary reader. The HIT platform, for example, when we just search for a keyword, will also make available at the same time the original source from which the fragment was extracted, including, for example, a location map, images, notes, and references. The HIT project will contribute to innovation in the humanities in three key ways: (a) in user interface, by producing a means by which users are able to interact with this integrated knowledge as one can see below; b) in allowing the integration of Purdue library databases (ComDisDome, Historical Abstracts with Full Text, ITER, JSTOR, MUSE, Patrologia Latina Database, etc.); (c) in the integration of valuable humanities contents which could be located on various external sources or repositories to produce original and valuable knowledge. As a consequence, with the integrated knowledge management system, the text itself is presented with its context, which means the humanistic knowledge that integrates the learning environment. Reading will consist of unfolding multiple and unique textual units onto the screen, units that will be created in accordance with each reader ’ s focus or interest.
HIT can address the two major problems of contemporary digital humanities: overload of useless information and lack of textual context. The world of electronic communication is a world of textual overabundance in which the written texts that are offered go far beyond the reader ’ s ability to take advantage of them. Often, researchers have denounced the uselessness of the overload of information on the web. Thus, ideally, one should know where, why, and how she or he should gather the most accurate and reliable texts on the internet. This is precisely what HIT will do by organizing and synthesizing data and texts—all of them available in one single search. What we are going to demonstrate today is only a PROOF OF CONCEPT. However, our initial project was base on these concepts. We researched and selected information available on the internet and filter out only needed and trustful information. We did a survey with different professors from different field in order to find out about their most reliable websites. HIT analyze not only external repositories but also internal repositories, such as Purdue Library ’ s database and catalogs. The other problem facing current digital humanities is that texts, content or information are usually provided without taking into account its context. Reading in front of the computer screen is generally a discontinuous reading process that seeks, using keywords or thematic headings, the fragment that the reader wishes to find: an article in an electronic periodical, a passage in a book, or some information on a website. This is done without necessarily knowing the identity or coherence of the entire text from which the fragment was extracted. In a certain sense, one might say that in the digital world all textual entities are like databases that offer fragments, the reading of which in no way implies a perception of the work or the body of works from which they came. This explains the confusion of the contemporary reader. The HIT platform, for example, when we just search for a keyword, will also make available at the same time the original source from which the fragment was extracted, including, for example, a location map, images, notes, and references. This is my idea of integrating all these information and make it flexible to all the people.
Constantino Malag ón Professor of Computer Engineering Universidad Antonio de Nebrija, Spain Justo Hidalgo Vice-Presindent, Product Management and Consulting at Denodo Technologies Co-Founder of the 24symbols Company Both have to work hard to make my request.
The function and development of the HIT web portal The HIT project will be constructed according to the architecture image shown below. I will explain its four layers starting from the very bottom of the image. Acquisition Layer : The different data sources that provide early modern age documents in digitalized form, their transcriptions, plus any other useful internal or web-based external repositories, will be accessed by the Data Acquisition Layer, as shown at the bottom of the figure. One of the critical assets of this component is that the web data extraction module is capable of extracting web data in a structured manner, therefore converting the web in a “ virtual database. ” Processing Layer : This platform provides the opportunity of combining, mashing up and transforming the data from heterogeneous databases and sources in an easier and more powerful way. Specifically, the architecture proposed will be able to perform syntactic (i.e. transformations and combinations based on the structure of the content extracted, such as unifying the names of authors based on whether we want a structure of the kind {surname, first_name} or {first_name surname}) and semantic (i.e. transformations and combinations based on the meaning of the content extracted) tasks. From this layer on, Justo Hidalgo, from Denodo Technology, will develop the software. The HIT interface will be built by following the most relevant industry standards, such as JDBC, ODBC, SOAP/WSDL and REST, for both data access and publishing. Categorization Layer : The categorization module, on top of the data combination layer, sorts out information previously stored or delivered in real time, and it assigns each piece of information to a set of categories. Final View : Finally, a basic presentation layer is built in order to allow researchers to visualize the overall mashup and categorization results. The platform is built as a series of components, by following the best practices in software engineering, which simplify the development and integration of all the resources. This is shown in the following image.
In order to do that we need: To extend the list of repositories. By repositories we mean two kind of data sources: - Structured: for example, any database ,which has tables, fields, records and values. This includes any sources from Purdue Library. These are called core sources. - Unstructured or semistructured: these include web pages or plain text files. For example, wikipedia. This are called context sources because they provide contextual information based on the author, document or whatever we choose. We have the survey of frequent use databases by different faculty members at CLA. Our first step will be to develop the first rating categories—structured and unstructured—from the list (see attached file). To extend the list of functionalities: To develop the application for mobile devices: Android and Apple iOS To adapt the web design to the Purdue standards. - The results screen should be more interactive (like igoogle, you should be able to move the different panels, and show or hide some of them). In order to do that, we have to develop the system by using the very latest web technologies, like html5. HIT mashup will be stored at Purdue University with a domain name like http://cla.purdue.edu/hit To secure the system. We need users to authenticate with their own Purdue account (user and password), using secure protocols like https. To develop a caching results module - this module will make our system faster.
The HIT system is jointly developing with the collaboration of some of the members of the MMEDIS and the new HIT team members.