SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Implementation of a Databaseless Web REST API
for the Unstructured Texts of Migne’s Patrologia Graeca
with Searching capabilities and additional Semantic and Syntactic
expandability
• Evagelos Varthis, evagelosvar@gmail.com
• Department of Archives, Library Science and Museology, Ionian University, Corfu Greece
• Marios Poulos, mpoulos@ionio.gr
• Department of Archives, Library Science and Museology/Ionian University, Corfu Greece
• Ilias Yarenis, yarenis@ionio.gr
• Department of History, Ionian University, Corfu Greece
• Sozon Papavlasopoulos, sozon@ionion.gr
• Department of Archives, Library Science and Museology, Ionian University, Corfu Greece
Table of Contents
¨ Migne’s Patrologia Graeca Overview
¨ Comparison of RDBMS and NoSQL systems
¨ Description of our proposed system
¤ Transformation Process
¤ Overall REST API Architecture
¨ Extensions and future work
Patrologia Graeca: Overview
¨ Patrologia Graeca is a Large Collection of 166 volumes (bound as
161)
¨ Contains mostly works by east Christian fathers
¨ Works written during a period of 1400 years (1st Century-14th
Century)
¨ Contains vast amount of information related to philosophy, psychology,
theology, music, politics, everyday life, e.t.c
Patrologia Graeca: on the Web
¨ In general there is a lack of systems for searching Patrologia Graeca
or other large corpora on the Web Domain.
¤ Patrologia Graeca has been converted in textual form (roughly 20% ) by
TLG. Offers limited searching through musaios.
¤ Perseus Library has textual data of Patrologia Graeca, however less
comprehensive. No searching.
Patrologia Graeca:
Searching on Large Corpora
The motivation for this paper is based on the following thoughts:
¨ How to find the fragments of a Large Collection that contains a specific word-term for further processing on the
Web Domain e.g. Patrologia Graeca texts (which has nearly one million unique words)?
¤ 1) Download the Collection which is the common practice and then
¤ 2) Write a program or use a ready made one to search the locally downloaded collection
¨ Who is interested in such kind of service and would follow the above procedure?
¤ Scholars with interest in specific corpus that know programming and NLP science (very few)
¤ Researchers that focus on the field of NLP science to build tree-banks, ontologies e.t.c.
¨ Who is interested in searching Large Corpora directly on the Web Domain ?
¤ We have the view that Scholars, simple Users and Researchers in various fields would enjoy such service
RDBMS vs NoSQL
RDBMS vs NoSQL
¨ What options exist for searching Large Corpora on the
Web Domain ?
¤Relational Database Management Systems (RDBMS)
nOracle, Microsoft SQL, MySql, PostgreSQL e.t.c
¤No-SQL Databases
nCouchDB, MongoDB, Dynamo e.t.c
RDBMS vs NoSQL: Comparative Characteristics
RDBMS NoSQL
People to input the textual data People to input the textual data
Specific structure
(Data Tables)
Schema agnostic nature
(key-value, graph, wide column, or document models)
Mostly SQL language Object-oriented programming for interacting
Difficulty in modifying the structure Ease in modifying the structure
Not easily scalable (possible Vertical scalability) Easily Scalable (horizontal scalability)
High traffic stability problems Better response to high traffic
Increased Complexity Less Complexity
License fees in some cases License fees in some cases
Overall High Cost to build and maintain Less Costly to build and maintain
Description of our System
Description of our System:
Overview of our Web REST API System
¨ Auto-transforms the textual data to a similar to a No-SQL database however with completely
static nature. No people to intervene for the transformation or input the textual data.
¨ The final system consists of a set of JSON files (which represent objects) served by a Web Server
or a S3 storage
Some benefits derived from the above statements (some coincide with those of NoSQL systems)
¤ Simplicity
¤ Fast searching
¤ Scalability
¤ Extensibility
¤ Easy replication and mirroring of the system
¤ Independence of the User Interface
¤ No need for specialized management
¤ In general Costless
Description of our System:
The auto-transformation
Unstructured
Works Building
fragment -files
Building
word -files
Building
stem -files
Web
REST
API
Description of our System:
The auto-transformation
Auto–Transformation is achieved in 3 stages by using a bash shell
script which is executed in parallel for better performance
¨ Splitting each work in a set of JSON fragment-files
Each fragment file contains:
¤ A Paragraph or
¤ A specific number of text lines or
¤ A specific number of words or
¤ …
¨ Creating the word-files i.e. JSON files with filename the name of the searching word. This word-file
contains the links to the JSON fragment-files that contain the searching word.
¨ Creating a set of two-letter stem-files from the set of the word-files.
Description of our System:
Splitting each work in JSON fragment files
JSON fragment files
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
Description of our System:
JSON fragment-file
Description of our System:
Building the JSON word-files
{
fragment 1
fragment 4
fragment 5
}
γάρ
{
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
}
καί
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
Description of our System:
JSON word-file
Description of our System:
Building the stem-files
¨ We classify the word-
files into two-letter stem
files
¨ Otherwise the search
through the Web
browser would be
unfocused
¨ We know what words to
search
¨ We know what words
there are in the corpus
{
δικαστήριο
δικάζω
...
...
}
δι
{
καί
κατά
...
...
}
Set of JSON Word-files
δικαστήριο
δικάζω κατά
καί
κα
JSON two letter stem-file
Overall REST API Architecture
REST API Resources
¨ myServer.org/api/words/word_id Verb= GET
¨ http://patrologia.tk/api/words/Ναρσὴς working example
¤ The response of above URI is the answer to the question:
¤ Which fragments contain the word (e.g. Ναρσὴς)
¤ A list with the links of the fragments that contain the specific word is presented
¨ myServer.org/api/list/list_id Verb= GET
¨ http://patrologia.tk/api/list/Να working example
¤ A list with the words that begin with two letter stem is presented
¨ myServer.org/api/fragments/fragment_id Verb= GET
¨ http://patrologia.tk/api/fragments/fragment-@-#-01635.frg working example
¤ The actual text-fragment is presented
Overview of the REST API Architecture
Architecture of the Web browser-based Interface
Web browser-based Interface
Extensions and future work
Extensions and future work
¨ Extensions easily implemented
¨ Searching for more than one word
¨ Searching individually on various corpora using the same interface
¨ Searching concurrently on various corpora using the same interface
¨ Simplified searching enrichment at word level
¨ Citation Service for fragments
¨ Future work
¨ Syntactic enrichment and Semantic enrichment
¨ Ranking the results
¨ Terminal based processing of the system
Searching for more than one word
¨ We do not search the whole corpus, instead we compare ONLY the
fragment-links,
¨ trivial task accomplished by comparing the JSON files using JavaScript
{
fragment 1
Fragment 4
fragment 5
}
γάρ
{
fragment 1
fragment 2
fragment 3
fragment 4
fragment 5
}
καί
{
fragment 5
}
τόν
Searching individually in various corpora
Searching individually in various collections
by using the same Interface
¨ Patrologia Graeca Texts
¨ Ancient Greek Texts
¨ Modern Greek Texts
¨ Other Language Textual Collections
We change only the base callable URI of the specific transformed collection e.g.
myHomer.org/api/words/ Ναρσὴς
myPatrologia.org/api/words/Ναρσὴς
e.t.c
Searching concurrently in various corpora
Getting concurrent results from various corpora
by using the same Interface
¨ e.g. what fragments exist in Homer texts and Patrologia Graeca texts
with the word-term «Θεός»(God)
¨ Based on this ability and simplicity of the system, we can further
process the interconnected results
¨ We get as response Web URIs that can be saved easily and offline
processed for applying NLP methods to build syntactic and semantic
enrichment.
Simplified Semantic Interconnections at word level
¨ The user can also navigate from one word results to another word
results by clicking the word of interest in the specific fragment
Citation service for fragments
Researchers can refer to specific location of interest
¨ myServer.org/api/fragments/fragment_id
Future work
Syntactic and Semantic enrichment
¨ The REST API can be extended independently from the previous presented auto-
transformation to offer more resources such as:
¨ http://myServer.org/api/syntactic/words/word_id
¨ http://myServer.org/api/semantic/words/word_id
/api/syntactic/Ναρσής /api/semantic/Ναρσής
{
Entity : Noun,
Gender: Male,
Case : Nominative
…
}
{
Entity : Person,
Profession : Military General,
Nationality: Armenian
…
}
Syntactic and Semantic enrichment
¨ The Syntactic extension acts in a similar way as the tree-banks
¤ Describes the Grammatical forms
¤ Lemmas
¤ Synonyms etc
¨ The Semantic extension
¤ Describes entities such as Persons, Places, Cities e.t.c,
¤ Interconnects resources related to the fragments to enrich the text, such as the scanned
page of Patrologia Graeca which has corresponding comments
¨ ==========================================
¨ However it is not an easy task, detailed design is required and is left for future
communication
Ranking the results
¨ Searching via the lemma of the words
¤ or
¨ Searching by using Regex style
¤ and
¨ Ranking the results by using tf-idf weighting and cosine similarity (Vector
Space Model) .
n In general the tf-idf weights is easy to be computed during our transformation
n Vectors have large dimension (1 million), however the similarity it is possible to be computed
fast on the Web Domain with proper technique.
¨ It is strongly considered for future implementation
Terminal based processing
¨ In the paper we also describe how we can search from a Linux
terminal without downloading the full transformed Collection.
¨ A variety of software tools can be build, in a similar way with the tools
for S3 storage.
¨ ==========================================
¨ Creating an easy and intuitive to use software tool is also strongly
considered for future implementation
¨END

Weitere ähnliche Inhalte

Was ist angesagt?

Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Document Oriented Access to Graphs
Document Oriented Access to GraphsDocument Oriented Access to Graphs
Document Oriented Access to GraphsNeo4j
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availabilityRuben Verborgh
 
Designing a RESTful web service
Designing a RESTful web serviceDesigning a RESTful web service
Designing a RESTful web serviceFilip Blondeel
 
Expressive Querying of Semantic Databases with Incremental Query Rewriting
Expressive Querying of Semantic Databases with Incremental Query RewritingExpressive Querying of Semantic Databases with Incremental Query Rewriting
Expressive Querying of Semantic Databases with Incremental Query RewritingAlexandre Riazanov
 
Design Beautiful REST + JSON APIs
Design Beautiful REST + JSON APIsDesign Beautiful REST + JSON APIs
Design Beautiful REST + JSON APIsStormpath
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
RESTful JSON web databases
RESTful JSON web databasesRESTful JSON web databases
RESTful JSON web databaseskriszyp
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxRuben Verborgh
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is FederatedRuben Verborgh
 
Restful webservice
Restful webserviceRestful webservice
Restful webserviceDong Ngoc
 

Was ist angesagt? (20)

Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Document Oriented Access to Graphs
Document Oriented Access to GraphsDocument Oriented Access to Graphs
Document Oriented Access to Graphs
 
REST API Design
REST API DesignREST API Design
REST API Design
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
 
Designing a RESTful web service
Designing a RESTful web serviceDesigning a RESTful web service
Designing a RESTful web service
 
Expressive Querying of Semantic Databases with Incremental Query Rewriting
Expressive Querying of Semantic Databases with Incremental Query RewritingExpressive Querying of Semantic Databases with Incremental Query Rewriting
Expressive Querying of Semantic Databases with Incremental Query Rewriting
 
Design Beautiful REST + JSON APIs
Design Beautiful REST + JSON APIsDesign Beautiful REST + JSON APIs
Design Beautiful REST + JSON APIs
 
RDFa Tutorial
RDFa TutorialRDFa Tutorial
RDFa Tutorial
 
WebServices
WebServicesWebServices
WebServices
 
Web data from R
Web data from RWeb data from R
Web data from R
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
RESTful JSON web databases
RESTful JSON web databasesRESTful JSON web databases
RESTful JSON web databases
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
 
Dos and donts
Dos and dontsDos and donts
Dos and donts
 
From ontology to wiki
From ontology to wikiFrom ontology to wiki
From ontology to wiki
 
SMWCon Fall 2015 FForms
SMWCon Fall 2015 FFormsSMWCon Fall 2015 FForms
SMWCon Fall 2015 FForms
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
J s o n
J s o nJ s o n
J s o n
 
Restful webservice
Restful webserviceRestful webservice
Restful webservice
 
Data programing
Data programingData programing
Data programing
 

Ähnlich wie Session5 04.evangelos varthis

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesDave Lewis
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebEditor IJCATR
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?mikaelbarbero
 
How can we develop an ideal website.pptx
How can we develop an ideal website.pptxHow can we develop an ideal website.pptx
How can we develop an ideal website.pptxPradeepK199981
 
Devoxx 2008 - REST in Peace
Devoxx 2008 - REST in PeaceDevoxx 2008 - REST in Peace
Devoxx 2008 - REST in Peacestevenn
 
Corrib.org - OpenSource and Research
Corrib.org - OpenSource and ResearchCorrib.org - OpenSource and Research
Corrib.org - OpenSource and Researchadameq
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
 

Ähnlich wie Session5 04.evangelos varthis (20)

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic Web
 
80068
8006880068
80068
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
How can we develop an ideal website.pptx
How can we develop an ideal website.pptxHow can we develop an ideal website.pptx
How can we develop an ideal website.pptx
 
Devoxx 2008 - REST in Peace
Devoxx 2008 - REST in PeaceDevoxx 2008 - REST in Peace
Devoxx 2008 - REST in Peace
 
Unit 2
Unit 2Unit 2
Unit 2
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elastic search
Elastic searchElastic search
Elastic search
 
Corrib.org - OpenSource and Research
Corrib.org - OpenSource and ResearchCorrib.org - OpenSource and Research
Corrib.org - OpenSource and Research
 
Semantic web
Semantic web Semantic web
Semantic web
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Kürzlich hochgeladen

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Session5 04.evangelos varthis

  • 1. Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne’s Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability • Evagelos Varthis, evagelosvar@gmail.com • Department of Archives, Library Science and Museology, Ionian University, Corfu Greece • Marios Poulos, mpoulos@ionio.gr • Department of Archives, Library Science and Museology/Ionian University, Corfu Greece • Ilias Yarenis, yarenis@ionio.gr • Department of History, Ionian University, Corfu Greece • Sozon Papavlasopoulos, sozon@ionion.gr • Department of Archives, Library Science and Museology, Ionian University, Corfu Greece
  • 2. Table of Contents ¨ Migne’s Patrologia Graeca Overview ¨ Comparison of RDBMS and NoSQL systems ¨ Description of our proposed system ¤ Transformation Process ¤ Overall REST API Architecture ¨ Extensions and future work
  • 3. Patrologia Graeca: Overview ¨ Patrologia Graeca is a Large Collection of 166 volumes (bound as 161) ¨ Contains mostly works by east Christian fathers ¨ Works written during a period of 1400 years (1st Century-14th Century) ¨ Contains vast amount of information related to philosophy, psychology, theology, music, politics, everyday life, e.t.c
  • 4. Patrologia Graeca: on the Web ¨ In general there is a lack of systems for searching Patrologia Graeca or other large corpora on the Web Domain. ¤ Patrologia Graeca has been converted in textual form (roughly 20% ) by TLG. Offers limited searching through musaios. ¤ Perseus Library has textual data of Patrologia Graeca, however less comprehensive. No searching.
  • 5. Patrologia Graeca: Searching on Large Corpora The motivation for this paper is based on the following thoughts: ¨ How to find the fragments of a Large Collection that contains a specific word-term for further processing on the Web Domain e.g. Patrologia Graeca texts (which has nearly one million unique words)? ¤ 1) Download the Collection which is the common practice and then ¤ 2) Write a program or use a ready made one to search the locally downloaded collection ¨ Who is interested in such kind of service and would follow the above procedure? ¤ Scholars with interest in specific corpus that know programming and NLP science (very few) ¤ Researchers that focus on the field of NLP science to build tree-banks, ontologies e.t.c. ¨ Who is interested in searching Large Corpora directly on the Web Domain ? ¤ We have the view that Scholars, simple Users and Researchers in various fields would enjoy such service
  • 7. RDBMS vs NoSQL ¨ What options exist for searching Large Corpora on the Web Domain ? ¤Relational Database Management Systems (RDBMS) nOracle, Microsoft SQL, MySql, PostgreSQL e.t.c ¤No-SQL Databases nCouchDB, MongoDB, Dynamo e.t.c
  • 8. RDBMS vs NoSQL: Comparative Characteristics RDBMS NoSQL People to input the textual data People to input the textual data Specific structure (Data Tables) Schema agnostic nature (key-value, graph, wide column, or document models) Mostly SQL language Object-oriented programming for interacting Difficulty in modifying the structure Ease in modifying the structure Not easily scalable (possible Vertical scalability) Easily Scalable (horizontal scalability) High traffic stability problems Better response to high traffic Increased Complexity Less Complexity License fees in some cases License fees in some cases Overall High Cost to build and maintain Less Costly to build and maintain
  • 10. Description of our System: Overview of our Web REST API System ¨ Auto-transforms the textual data to a similar to a No-SQL database however with completely static nature. No people to intervene for the transformation or input the textual data. ¨ The final system consists of a set of JSON files (which represent objects) served by a Web Server or a S3 storage Some benefits derived from the above statements (some coincide with those of NoSQL systems) ¤ Simplicity ¤ Fast searching ¤ Scalability ¤ Extensibility ¤ Easy replication and mirroring of the system ¤ Independence of the User Interface ¤ No need for specialized management ¤ In general Costless
  • 11. Description of our System: The auto-transformation Unstructured Works Building fragment -files Building word -files Building stem -files Web REST API
  • 12. Description of our System: The auto-transformation Auto–Transformation is achieved in 3 stages by using a bash shell script which is executed in parallel for better performance ¨ Splitting each work in a set of JSON fragment-files Each fragment file contains: ¤ A Paragraph or ¤ A specific number of text lines or ¤ A specific number of words or ¤ … ¨ Creating the word-files i.e. JSON files with filename the name of the searching word. This word-file contains the links to the JSON fragment-files that contain the searching word. ¨ Creating a set of two-letter stem-files from the set of the word-files.
  • 13. Description of our System: Splitting each work in JSON fragment files JSON fragment files fragment 1 fragment 2 fragment 3 fragment 4 fragment 5
  • 14. Description of our System: JSON fragment-file
  • 15. Description of our System: Building the JSON word-files { fragment 1 fragment 4 fragment 5 } γάρ { fragment 1 fragment 2 fragment 3 fragment 4 fragment 5 } καί fragment 1 fragment 2 fragment 3 fragment 4 fragment 5
  • 16. Description of our System: JSON word-file
  • 17. Description of our System: Building the stem-files ¨ We classify the word- files into two-letter stem files ¨ Otherwise the search through the Web browser would be unfocused ¨ We know what words to search ¨ We know what words there are in the corpus { δικαστήριο δικάζω ... ... } δι { καί κατά ... ... } Set of JSON Word-files δικαστήριο δικάζω κατά καί κα
  • 18. JSON two letter stem-file
  • 19. Overall REST API Architecture
  • 20. REST API Resources ¨ myServer.org/api/words/word_id Verb= GET ¨ http://patrologia.tk/api/words/Ναρσὴς working example ¤ The response of above URI is the answer to the question: ¤ Which fragments contain the word (e.g. Ναρσὴς) ¤ A list with the links of the fragments that contain the specific word is presented ¨ myServer.org/api/list/list_id Verb= GET ¨ http://patrologia.tk/api/list/Να working example ¤ A list with the words that begin with two letter stem is presented ¨ myServer.org/api/fragments/fragment_id Verb= GET ¨ http://patrologia.tk/api/fragments/fragment-@-#-01635.frg working example ¤ The actual text-fragment is presented
  • 21. Overview of the REST API Architecture
  • 22. Architecture of the Web browser-based Interface
  • 25. Extensions and future work ¨ Extensions easily implemented ¨ Searching for more than one word ¨ Searching individually on various corpora using the same interface ¨ Searching concurrently on various corpora using the same interface ¨ Simplified searching enrichment at word level ¨ Citation Service for fragments ¨ Future work ¨ Syntactic enrichment and Semantic enrichment ¨ Ranking the results ¨ Terminal based processing of the system
  • 26. Searching for more than one word ¨ We do not search the whole corpus, instead we compare ONLY the fragment-links, ¨ trivial task accomplished by comparing the JSON files using JavaScript { fragment 1 Fragment 4 fragment 5 } γάρ { fragment 1 fragment 2 fragment 3 fragment 4 fragment 5 } καί { fragment 5 } τόν
  • 27. Searching individually in various corpora Searching individually in various collections by using the same Interface ¨ Patrologia Graeca Texts ¨ Ancient Greek Texts ¨ Modern Greek Texts ¨ Other Language Textual Collections We change only the base callable URI of the specific transformed collection e.g. myHomer.org/api/words/ Ναρσὴς myPatrologia.org/api/words/Ναρσὴς e.t.c
  • 28. Searching concurrently in various corpora Getting concurrent results from various corpora by using the same Interface ¨ e.g. what fragments exist in Homer texts and Patrologia Graeca texts with the word-term «Θεός»(God) ¨ Based on this ability and simplicity of the system, we can further process the interconnected results ¨ We get as response Web URIs that can be saved easily and offline processed for applying NLP methods to build syntactic and semantic enrichment.
  • 29. Simplified Semantic Interconnections at word level ¨ The user can also navigate from one word results to another word results by clicking the word of interest in the specific fragment
  • 30. Citation service for fragments Researchers can refer to specific location of interest ¨ myServer.org/api/fragments/fragment_id
  • 32. Syntactic and Semantic enrichment ¨ The REST API can be extended independently from the previous presented auto- transformation to offer more resources such as: ¨ http://myServer.org/api/syntactic/words/word_id ¨ http://myServer.org/api/semantic/words/word_id /api/syntactic/Ναρσής /api/semantic/Ναρσής { Entity : Noun, Gender: Male, Case : Nominative … } { Entity : Person, Profession : Military General, Nationality: Armenian … }
  • 33. Syntactic and Semantic enrichment ¨ The Syntactic extension acts in a similar way as the tree-banks ¤ Describes the Grammatical forms ¤ Lemmas ¤ Synonyms etc ¨ The Semantic extension ¤ Describes entities such as Persons, Places, Cities e.t.c, ¤ Interconnects resources related to the fragments to enrich the text, such as the scanned page of Patrologia Graeca which has corresponding comments ¨ ========================================== ¨ However it is not an easy task, detailed design is required and is left for future communication
  • 34. Ranking the results ¨ Searching via the lemma of the words ¤ or ¨ Searching by using Regex style ¤ and ¨ Ranking the results by using tf-idf weighting and cosine similarity (Vector Space Model) . n In general the tf-idf weights is easy to be computed during our transformation n Vectors have large dimension (1 million), however the similarity it is possible to be computed fast on the Web Domain with proper technique. ¨ It is strongly considered for future implementation
  • 35. Terminal based processing ¨ In the paper we also describe how we can search from a Linux terminal without downloading the full transformed Collection. ¨ A variety of software tools can be build, in a similar way with the tools for S3 storage. ¨ ========================================== ¨ Creating an easy and intuitive to use software tool is also strongly considered for future implementation
  • 36. ¨END