Session5 04.evangelos varthis

Implementation of a Databaseless Web REST API
for the Unstructured Texts of Migne’s Patrologia Graeca
with Searching capabilities and additional Semantic and Syntactic
expandability
• Evagelos Varthis, evagelosvar@gmail.com
• Department of Archives, Library Science and Museology, Ionian University, Corfu Greece
• Marios Poulos, mpoulos@ionio.gr
• Department of Archives, Library Science and Museology/Ionian University, Corfu Greece
• Ilias Yarenis, yarenis@ionio.gr
• Department of History, Ionian University, Corfu Greece
• Sozon Papavlasopoulos, sozon@ionion.gr
• Department of Archives, Library Science and Museology, Ionian University, Corfu Greece

Table of Contents
¨ Migne’s Patrologia Graeca Overview
¨ Comparison of RDBMS and NoSQL systems
¨ Description of our proposed system
¤ Transformation Process
¤ Overall REST API Architecture
¨ Extensions and future work

Patrologia Graeca: Overview
¨ Patrologia Graeca is a Large Collection of 166 volumes (bound as
161)
¨ Contains mostly works by east Christian fathers
¨ Works written during a period of 1400 years (1st Century-14th
Century)
¨ Contains vast amount of information related to philosophy, psychology,
theology, music, politics, everyday life, e.t.c

Patrologia Graeca: on the Web
¨ In general there is a lack of systems for searching Patrologia Graeca
or other large corpora on the Web Domain.
¤ Patrologia Graeca has been converted in textual form (roughly 20% ) by
TLG. Offers limited searching through musaios.
¤ Perseus Library has textual data of Patrologia Graeca, however less
comprehensive. No searching.

Patrologia Graeca:
Searching on Large Corpora
The motivation for this paper is based on the following thoughts:
¨ How to find the fragments of a Large Collection that contains a specific word-term for further processing on the
Web Domain e.g. Patrologia Graeca texts (which has nearly one million unique words)?
¤ 1) Download the Collection which is the common practice and then
¤ 2) Write a program or use a ready made one to search the locally downloaded collection
¨ Who is interested in such kind of service and would follow the above procedure?
¤ Scholars with interest in specific corpus that know programming and NLP science (very few)
¤ Researchers that focus on the field of NLP science to build tree-banks, ontologies e.t.c.
¨ Who is interested in searching Large Corpora directly on the Web Domain ?
¤ We have the view that Scholars, simple Users and Researchers in various fields would enjoy such service

RDBMS vs NoSQL
¨ What options exist for searching Large Corpora on the
Web Domain ?
¤Relational Database Management Systems (RDBMS)
nOracle, Microsoft SQL, MySql, PostgreSQL e.t.c
¤No-SQL Databases
nCouchDB, MongoDB, Dynamo e.t.c

RDBMS vs NoSQL: Comparative Characteristics
RDBMS NoSQL
People to input the textual data People to input the textual data
Specific structure
(Data Tables)
Schema agnostic nature
(key-value, graph, wide column, or document models)
Mostly SQL language Object-oriented programming for interacting
Difficulty in modifying the structure Ease in modifying the structure
Not easily scalable (possible Vertical scalability) Easily Scalable (horizontal scalability)
High traffic stability problems Better response to high traffic
Increased Complexity Less Complexity
License fees in some cases License fees in some cases
Overall High Cost to build and maintain Less Costly to build and maintain

Description of our System:
Overview of our Web REST API System
¨ Auto-transforms the textual data to a similar to a No-SQL database however with completely
static nature. No people to intervene for the transformation or input the textual data.
¨ The final system consists of a set of JSON files (which represent objects) served by a Web Server
or a S3 storage
Some benefits derived from the above statements (some coincide with those of NoSQL systems)
¤ Simplicity
¤ Fast searching
¤ Scalability
¤ Extensibility
¤ Easy replication and mirroring of the system
¤ Independence of the User Interface
¤ No need for specialized management
¤ In general Costless

The auto-transformation
Unstructured
Works Building
fragment -files
Building
word -files
Building
stem -files
Web
REST
API

The auto-transformation
Auto–Transformation is achieved in 3 stages by using a bash shell
script which is executed in parallel for better performance
¨ Splitting each work in a set of JSON fragment-files
Each fragment file contains:
¤ A Paragraph or
¤ A specific number of text lines or
¤ A specific number of words or
¤ …
¨ Creating the word-files i.e. JSON files with filename the name of the searching word. This word-file
contains the links to the JSON fragment-files that contain the searching word.
¨ Creating a set of two-letter stem-files from the set of the word-files.

Splitting each work in JSON fragment files
JSON fragment files
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5

JSON fragment-file

Building the JSON word-files
{
fragment 1
fragment 4
fragment 5
}
γάρ
{
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
}
καί
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5

JSON word-file

Building the stem-files
¨ We classify the word-
files into two-letter stem
files
¨ Otherwise the search
through the Web
browser would be
unfocused
¨ We know what words to
search
¨ We know what words
there are in the corpus
{
δικαστήριο
δικάζω
...
...
}
δι
{
καί
κατά
...
...
}
Set of JSON Word-files
δικαστήριο
δικάζω κατά
καί
κα

REST API Resources
¨ myServer.org/api/words/word_id Verb= GET
¨ http://patrologia.tk/api/words/Ναρσὴς working example
¤ The response of above URI is the answer to the question:
¤ Which fragments contain the word (e.g. Ναρσὴς)
¤ A list with the links of the fragments that contain the specific word is presented
¨ myServer.org/api/list/list_id Verb= GET
¨ http://patrologia.tk/api/list/Να working example
¤ A list with the words that begin with two letter stem is presented
¨ myServer.org/api/fragments/fragment_id Verb= GET
¨ http://patrologia.tk/api/fragments/fragment-@-#-01635.frg working example
¤ The actual text-fragment is presented

Overview of the REST API Architecture

Architecture of the Web browser-based Interface

Extensions and future work
¨ Extensions easily implemented
¨ Searching for more than one word
¨ Searching individually on various corpora using the same interface
¨ Searching concurrently on various corpora using the same interface
¨ Simplified searching enrichment at word level
¨ Citation Service for fragments
¨ Future work
¨ Syntactic enrichment and Semantic enrichment
¨ Ranking the results
¨ Terminal based processing of the system

Searching for more than one word
¨ We do not search the whole corpus, instead we compare ONLY the
fragment-links,
¨ trivial task accomplished by comparing the JSON files using JavaScript
{
fragment 1
Fragment 4
fragment 5
}
γάρ
{
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
}
καί
{
fragment 5
}
τόν

Searching individually in various corpora
Searching individually in various collections
by using the same Interface
¨ Patrologia Graeca Texts
¨ Ancient Greek Texts
¨ Modern Greek Texts
¨ Other Language Textual Collections
We change only the base callable URI of the specific transformed collection e.g.
myHomer.org/api/words/ Ναρσὴς
myPatrologia.org/api/words/Ναρσὴς
e.t.c

Searching concurrently in various corpora
Getting concurrent results from various corpora
by using the same Interface
¨ e.g. what fragments exist in Homer texts and Patrologia Graeca texts
with the word-term «Θεός»(God)
¨ Based on this ability and simplicity of the system, we can further
process the interconnected results
¨ We get as response Web URIs that can be saved easily and offline
processed for applying NLP methods to build syntactic and semantic
enrichment.

Simplified Semantic Interconnections at word level
¨ The user can also navigate from one word results to another word
results by clicking the word of interest in the specific fragment

Citation service for fragments
Researchers can refer to specific location of interest
¨ myServer.org/api/fragments/fragment_id

Syntactic and Semantic enrichment
¨ The REST API can be extended independently from the previous presented auto-
transformation to offer more resources such as:
¨ http://myServer.org/api/syntactic/words/word_id
¨ http://myServer.org/api/semantic/words/word_id
/api/syntactic/Ναρσής /api/semantic/Ναρσής
{
Entity : Noun,
Gender: Male,
Case : Nominative
…
}
{
Entity : Person,
Profession : Military General,
Nationality: Armenian
…
}

Syntactic and Semantic enrichment
¨ The Syntactic extension acts in a similar way as the tree-banks
¤ Describes the Grammatical forms
¤ Lemmas
¤ Synonyms etc
¨ The Semantic extension
¤ Describes entities such as Persons, Places, Cities e.t.c,
¤ Interconnects resources related to the fragments to enrich the text, such as the scanned
page of Patrologia Graeca which has corresponding comments
¨ ==========================================
¨ However it is not an easy task, detailed design is required and is left for future
communication

Ranking the results
¨ Searching via the lemma of the words
¤ or
¨ Searching by using Regex style
¤ and
¨ Ranking the results by using tf-idf weighting and cosine similarity (Vector
Space Model) .
n In general the tf-idf weights is easy to be computed during our transformation
n Vectors have large dimension (1 million), however the similarity it is possible to be computed
fast on the Web Domain with proper technique.
¨ It is strongly considered for future implementation

Terminal based processing
¨ In the paper we also describe how we can search from a Linux
terminal without downloading the full transformed Collection.
¨ A variety of software tools can be build, in a similar way with the tools
for S3 storage.
¨ ==========================================
¨ Creating an easy and intuitive to use software tool is also strongly
considered for future implementation

Session5 04.evangelos varthis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Session5 04.evangelos varthis

Ähnlich wie Session5 04.evangelos varthis (20)

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Session5 04.evangelos varthis