2. Project members
IJAZ UL HAQ (EJO) – GROUP LEADER
(ID:2820150066)
CHEN JIAYU
RON WONG SEE
ALEX
ZHENG YUE
GARY
PANG PENGFEI
3. Semantic Web
Experts expect online information to be organized in
smarter, more useful ways in coming years, but there is a
dispute about whether the improvements will match.
Sir Tim Berner’s vision is a web that allows software
agents to carry out sophisticated tasks for users, making
meaningful connections between bits of information so
“computers can perform more of the tedious work
involved in finding, combining, and acting upon
information on the web.
4. Project Description
Our project is about developing semantic search
in a library. Semantic web concept will be used to
engineer the searching process used in library
catalogue search systems and to make it efficient.
Ontology is a higher view of a database schema;
it helps generate queries to extract data from the
database by selecting classes and relationships.
5. Applications of ontology
• Searching & browsing
• Decision support system
• Question answering system
• Recommendation
• Data integration
• Etc.
6. Semantic digital library
• Proposed an approach for managing, organizing and
populating ontology for document collections in
digital library.
• The document metadata and content are inserted
and populated to a knowledge base which allows
sophisticated query and searching.
• Firstly to propose an ontology based information
retrieval model which is based on the classic vector
space model which includes document annotation,
instance-based weighting and concept-based
ranking.
10. Apache Jena Api’s
Jena is a programming toolkit, using
the Java programming language.
While there are a few command-line
tools to help you perform some key
tasks using Jena, mostly you use Jena
by writing Java programs.
11. Eclipse And MySQL
We used Apche Jena Api’s in Java Eclipse to get the
ontology file and search through the MySQL database
using that Ontology, this project is not a complete search
Engine for that book library, but we just differentiate
between the simple Syntactic Search and the Semantic
search, We have just put the data related to Software
Engineering books, not the actual book but the data
related to that book , and made the search to search
through the ontology. And that what sematic search is, to
search through the Documents or web using a pre defined
Ontology.
12. Future Work…
In coming future the searching engine should be that
smart to fulfill all the search requirement of User Query,
To show the user what user actually wants to see.
For the Our project, it should search through the
documents of books, to show user the books or articles
user wants.
There are many semantic search Engines, one of them is
Swoogle
18. RDF
Resource description framework (RDF) is a W3C
standard for describing web resources, such as the
title, author, date, content and copyright information
of a web page.
19. 1 A framework for describing resources on
the Web
2 Provides a model of the data and the syntax
3 Designed to be read and understood by
computer
4
Using XML
Not to show people
5
20. Use attributes and attribute values to describe
resources
Resources
http://www.w3school.com.cn/r
df
Attributes Author、homepage
Attribute values
David 、
http://www.w3school.com.cn
23. Title Artist Country Company Price Year
Empire Burlesque Bob Dylan USA Columbia 10.90 1985
Hide your heart Bonnie Tyler UK CBS
Rescord
9.90 1988
RDF instance
26. The proposed framework for
semantic annotation of Chinese
Web pages
From sentences to RDF
27. <div id="navfirst">
<ul id="menu">
<li id="h"><a href="/h.asp" title="HTML 系列教程">HTML 系列教程</a></li>
<li id="b"><a href="/b.asp" title="浏览器脚本教程">浏览器脚本</a></li>
<li id="s"><a href="/s.asp" title="服务器脚本教程">服务器脚本</a></li>
<li id="d"><a href="/d.asp" title="ASP.NET 教程">ASP.NET 教程</a></li>
<li id="x"><a href="/x.asp" title="XML 系列教程">XML 系列教程</a></li>
<li id="ws"><a href="/ws.asp" title="Web Services 系列教程">Web Services 系列教程
</a></li>
<li id="w"><a href="/w.asp" title="建站手册">建站手册</a></li>
</ul>
There are a large number of HTML documents on
the Web, these documents are for human reading, not
for machine processing, there is no semantic knowledge
that can be used by the computer.
28. In general, semantic tagging is a process
that is represented by the knowledge
representation of documents under the guidance
of domain ontology, which is usually divided
into two steps.
type tagging (TT)
relation extraction (RE)
First
Second
31. 1、Data Preparation
1、Domain ontology
The domain ontology is the core data of the
semantic annotation, the definition of ontology,
the attribute, and the data of the pre stored in the
ontology.
32. Concept Object properties Data type
properties
Instance
data
Total
422 87 147 2420 3096
Protege
Automatic program
extraction
Domain expert manual extraction
33. 2、Domain vocabulary
1、Data Preparation
The field of vocabulary is established by
statistical methods, the data sources to download
web pages in the focused crawler, clause
processing, data processing for natural language
text sentence set.
34. 2、Identification stage
Explicit attribute type labeling algorithm(EPTT)
Input: Word segmentation
Output: A collection of annotated types and a
new word for word segmentation.
35. 2、Identification stage
Begin:
Step1:Application identification rules,
recognition the general purpose type entities in
a sentence, Label type
Step2:Application of the list of words, the words
in the sentence are precisely matched, and the
corresponding types are marked
36. 2、Identification stage
Step3:Application N tuple (N-gram) segmentation
technique, myopic match the sentence with the
words in the annotation vocabulary list;if
success, the corresponding types are marked
Step4:The result of sentence segmentation is
adjusted to ensure that the type of the word is
not cut . If it has been separated from the
segmentation process, the word will be merged
into one word.
38. 3、 Assembly phase
Dependency grammar:
There is a direct relation between syntactic sentence
words, the syntactic relation is a direction, is usually a word
to govern another word, the dominant and dominated
relationship reflects the relationship between the words in
a sentenc.
39. 3、 Assembly phase
1、Dependency pair: Relation(Gov,Dep)
Gov:domain word Dep:Subordinate word
Relation:grammar relation
The dependency pair can form a dependency tree
according to parent node:Gov,the sub node :Dep.
2、Dependency tree
41. 3、 Assembly phase
3、Dependency forest
A sentence can be divided into a number of
clauses, each of which constitutes a dependency tree;
these trees can form the entire sentence of the
forest.
42. Based on dependency tree Relation extraction algorithm(D
Grammatical relationship triple,GRT
3、 Assembly phase
43. Conclusion
1、Explicit type marking method (EPTT) and Relation extra
Method(DTRE) is effective.
2、The field vocabulary list is still manual
tagging, the next step is to use machine learning
methods to achieve automation
48. Building a semantic search engine in library
--semantic search systems
Semantic search systems might combine a
range of techniques, ranging from statistics
based IR methods for ranking, database
methods for efficient indexing and query
processing, up to complex reasoning
techniques for making inferences!
49. History of Search Engine
The originator of search engine: Archie
The origin of modern search engine:
Wanderer
Yahoo
The first search engine in the modern
sense :Lycos Infoseek
The first meta-search engine: Metacrawler
The first search engine to support natural
language search : AltaVista
The belated king: Google
Chinese search engine in the first
place: Baidu
Concentrated,
aggressive, plain,
humility, setting
up (making money)
as a fairy tale
Focus on
technology and
Chinese search
50. Usage Frequency of Internet Applications in China
Frequency Frequency
Information channels Life assistant
Internet News 77.3% Job hunting 15.2%
Search engine 74.8% Online Education 24.0%
Write a blog 19.1% Online shopping 25.5%
Communication tools Internet sales 4.3%
Instant messaging 69.8% Online travel booking 3.9%
E-mail 55.4% Internet banking 20.9%
Entertainment tools Online stock trading 14.1%
Online music 68.5%
Online video 61.1%
Online games 47.0%
52. Definition:
Information
Retrieval (IR), is
the process to
find specific
information
from the data
source to meet
the needs.
IR
Look up the pronunciation and
meaning of a word from a
dictionary according to spelling
Look up a contact information from
a phone contacts
Look up sentences including a
word from electronic dictionary
electronic
information age
Information retrieval is the field about information
structure, analysis, organization, storage, search and
access.
Traditional
53. 1
2
3
Universal Search
The search on the World Wide Web is the most
common applications of information retrieval.
Vertical search
Enterprise
search
4
5
Desktop
Search
P2P search
It is a special form of Web search, the
search is limited on limited topics.
It finds the needed information from a large
number of distributed computer files in a
intranet.
It is the personal Edition of Enterprise
Search. The source is collection of files
stored on personal computers, including
documents, source code, mails and web
browsing history.
It search on a network nodes
but without centralized controller.
54. Key Issues In Information Retrieval
Relevance is a basic concept in information
retrieval.(Precision , Recall)
A retrieval model is a formal representation of the
matching procession between a query and a
document.
Evaluation: The quality of sorted documents
depends on the matching degree between the list
and user’s requirment.
56. A search engine is the practical
application of information retrieval
technology on large-scale text collection. Important issues
in the design of
search engine• including all problems in information
retrieval:
• effective sorting algorithms
• evaluation
• user interaction.
• Large-scale data brings many other
problems to search engine:
• response time
• query throughput
• indexing speed
the most
important issue
is the
performance of
the search
engine
58. preprocessing
Tokenizing is a important step in the text
preprocessing.
Documents and queries must be transformed into morphemes in the same
way.
For a given text, there may be several segmentation results, which will affect the
result of retrieval.
Removal of stopword
Stopword refers to the words appear the most frequently in documents and
have no actual meaning. For example, functional words such as: “the”,
“of”, “to” and “for”, etc.
The problem of using a stoplist is that if the user submits a query “to be or
not to be” or “down under”, search engines are not likely to return
search results.
Solution:The indexed phase uses a very small stoplist, but query phase with
a larger stoplist.
59. preprocessing
Stemming task is to normalize words derived from a same stem.
For example, classify "fish", "fishes", "fishing" into an equivalent class.
Stemming usually has slightly improvement on ranking. Like stop word removal, it is optional.
Stemming on all words may lead to a search problem.
Information extraction recognizes more complex index terms.
But usual information extraction require more complex calculation.
Named entity recognition is able to detect names, places, organizations, dates and so
on.
60. Index creation
1.Index term
Text conversion module converts documents into index terms or features.
2.Document statistics
Document statistics component summarize and record statistics characteristics
of words of documents
3.Weight computing
Weight of terms reflects their relative importance, are used to compute ranking
score.
4.Inversion component
Inversion component is the key component of indexing, which convert
document-term stream into word-document stream.
5.Index allocation
Index allocation component distribute index to computers, or nodes of a
network.
63. TF(term frequency):
is determined by the
number of
occurrence of it in a
document
1 2
DFt, document
frequency, it denotes
the number of
documents in which
term t appears.
The DF is often higher than TF by several orders of
magnitude, thus the impact of TF will be covered by DF.
It is necessary to map DF into to smaller value. Assume
the number corpus is N, the IDF of term t is (inverse
document frequency):
3
64. TF-IDF
We hope the weight of term t obey following rules:
(1) If t appears in only a few documents many times, it
weight is very high;
(2) If t appears few times in a document, or appears in
many documents, its weight is lower than (1). Now its
effect on the last correlation is small);
(3) if t appears in all the documents, the weight is the
minimum.
Combine TF and IDF to form term’s final weight.
65. Document Similarity
• Point distance
• sim(d1,d2) = |V(d1)-V(d2)|
• This value is related to the length of
documents:
In tf-idf, tf will vary with the length of doc.
Calculate the cosine
similarities of
vectors of query
and each
document.
Sort documents
according to
similarity, and
choose the K most
similar documents.
76. 1
What is the Jena ?
Jena is a Java framework for the creation of applications for
the Semantic Web
Provides interfaces and classes for the creation of RDF
Also provides classes/interfaces for the management of OWL-
based ontologies.
77. 2
What is the RDFS?
RDFS is the weakest ontology language supported by Jena. RDFS allows the ontologist to build a simple
hierarchy of concepts, and a hierarchy of properties. Consider the following trivial characterization.
A simple example:
78. 3
Jena API and ontology languages
Jena aims to provide a consistent programming interface for ontology application
development, independent of which ontology language you are using in your programs.
The Jena Ontology API is language-neutral : the Java class names are not specific to
the underlying language.
To represent the differences between the various representations, each of the ontology
languages has a profile.
OWL:
RDFS: null (RDFS does not define object properties)
79. 4
Ontology Model
Ontology model is an extended version of Jena's Model class. The base Model allows access to
the statements in a collection of RDF data.
OntModel extends the base Model by adding support for the kinds of constructs expected to be
in an ontology: classes (in a class hierarchy), properties (in a property hierarchy) and individuals.
All of the state information remains encoded as RDF triples stored in the RDF model.
The ontology API doesn't change the RDF representation of ontologies, just adding a set of
convenience classes and methods that make it easier for you to write programs that manipulate
the underlying RDF triples.
82. 6
Create RDF Models——Resources
A simple example : People Resources
Resource “http://…/JohnSmith”represent a person
“John Smith”is a property
In Jena, resources are represented by the Resource class,and its
property is represented by the Property class. And the overall
model with the Model class to express. A Model object can
contain multiple resources.
83. 7
Create RDF Models——
Statement
Each arrows in Model is a statement . Statement is composed of three
parts, namely subject, predicate and object.
Subject: The location of the arrow in the diagram. Representative
resources.
Predicate: Arrow in the diagram. Attribute of resources.
Object: the position of the arrow in the diagram. Value representing
attributes. It can be text, it can be a resource.
84. 8
Output RDF
We can write an output stream through the write
Model method in model.
• model.write(OutputStream) : 也可以用
model.write(OutputStream, null) 代替。默认的输出格式。
• model.write(OutputStream, "RDF/XML-ABBREV"): 使用XML 缩略
语法输出RDF。
• model.write(OutputStream, "N-TRIPLE"): 输出n 元组的格式。
85. 9
Input RDF
We can write an input stream
through the read Model method in
model.
86. 10
Operation in
Model
Model.remove:can achieve the statement of the delete
operation
Model.add:can achieve the increase of statement.
Model.intersection(Model model): Intersection operation.
To create a new Model, the new Model contains two parts
in the previous.
Model.union(Model model): And operation. To create a
new Model, the new Model contains a part of the previous
two Model.
Model.difference(Model model): Repair operation. Create
a new Model, the new Model contains a single in the
Model of the parameters shown in the Model is not part of
the.
87. 11
union operation in model
Both of the two models have the same property “vcard:FN”
After using union operation,the repeated values “vcard:FN” only appear once.
88. 12
Reasoner
Jena contains a series of reasoning rules, mainly for the
characteristics of the definition of some of the rules, for
checking the concept of the relationship between
different classes, attributes of the transfer, mutual
inverse, disjoint, etc.
These rules can be called general rules, but it can not
meet the requirements in some specific information
retrieval of the specific areas. In this situation,we can
customize rules to meet the specific requirements.The
custom rule is the supplement to the general rule, also
is the actual application in the individual need.
Rule:(? x work in ? y),(? y use ? z) (? x use ? z)
89. 13
The reasoning machine works can be
summarized as follows:
(1) Create the reasoning machine
according to the resource and
Ontology, which have been created
or read into RDF three tuple.
(2) Obtain the model object (InfGraph)
by Model API and Ontology API.
(3) Through the concept of reasoning.
complete the semantic based
information retrieval, get the desired
results by using OntologyAPI and
ModelAPI
92. 16
Persistent ontology to
database
The persistence model for any database is created
by the following steps:
1) load database JDBC driver
2) to create a database connection
3) to create a ModelMaker for the database
4) creating a model for Ontology
93. 17
Table Name Content
jena_g1t1_stmt Ontology data
jena_g1t0_reif Processed ontology data
jena_sys_stmt System metadata
jena_graph Each user's name and unique identifier
jean_long_lit Long character constants
that are not easy to store directly in a statement
jena_long_uri A long URI
which is not easy to store directly in a statement.
jena_prefix URI prefix
100. The Challenges of Semantic Search Engine.
Alex N. Mugire. 2820150025
Beijing Institute of Technology
Digital Library 2015
101. Introduction.
Todays big problem in the information society is information
overload, a problem which is boosted by the huge size of the
world wide web (WWW). The Web has given us access to
millions of resources, irrespective of their physical location and
language.
With the expected continuous growth of the World Wide Web
(WWW), we expect search engines will have a hard time
maintaining the quality of retrieval results. Moreover, they only
access static content, and ignore the dynamic part of the web
(pages generated from databases/ updated data).
I there fore explained some major challenges in this
presentation.
102. Challenges;
The Availability of Content .
Currently, there is little Semantic Web content
available. Existing web content should be upgraded
to Semantic Web content including static hypertext
markerup language (HTML) pages, existing XML
content, and dynamic content, multimedia and web
services.
103. Scalability of Semantic Web Content.
Once we have the Semantic Web content, we need to worry about
how to manage it in a scalable manner, that is how to organize it,
where to store it and how to find the right content. Effective
exploitation of the linked data requires infrastructure that scales
to a large and ever growing collection of interlinked data.
104. Heterogeneity.
Effective exploitation of the data web requires an
effective mechanism for
Finding the relevant data sources,
Integrating data sources and
Combining elements from different data sources.
105. Uncertainty.
Incomplete Representation of User’s Needs and content meanings
User cannot completely specify the need which results into missing of required
data.
Example “Find action films directed by some Hong Kong film director and starring
Chinese martial actors” this creates to uncertainty of data in search area.
The semantic information in the search space is incomplete.
106. Multilinguality
This problem already exists in the current Web, and should also
be tackled in the Semantic Web. Any Semantic Web approach
should provide facilities to access information in several
languages, allowing the creation and access to semantic web
search content independently of the native language of content
providers and users.
From eMarketer Source: Vilaweb.com who showed this statistics of
languages English 68.4%, Japanese 5.9%, German 5.8%, Chinese 3.9%,
French 3.0%, Spanish 2.4%, Russian 1.9%, Italian 1.6%, Portuguese 1.4%,
Korean 1.3% ,Other 4.6%
108. The development of a domain ontology.
Ontologies are playing big role of enabling the semantic web.
Semantic web communities develop ontologies in their domains,
which includes many experts in the same domain and each of have his
or her own social challenge. What does this create?
This will require a technical team to control, manage, coordinate
and collaborate the support, which is a challenge to have such
development. Most of the ontology development tools today, like
Protégé-2000 are personal ontology editors and they lack these
functionalities.
109. Unnecessary Adverts Challenge.
This has become a usual challenge where some websites includes
this for that whenever you log in to them you have to be interacted
by some necessarily adverts which sometimes collapses your
search query and your computer if and only if you install some of
their advertised apps. This creates challenge to one’s wish of
searching his intent and takes much of our time. This snapshot
indicates the issue above.
See snapshot below,
110.
111. Conspiracy and Hoax Websites.
There are some people who benefit from deceiving, and writing
wrong updates on matters depending on their wish. This creates
to wrong turn of one’s search intention.
Measures to disable this challenge are needed with much
desire to save time and quick the semantic search.
112. Conclusion,
I tried to identify some of challenges of semantic web search that is
currently affecting to day and tomorrow as searching tool continues
to advance forward in world wide.
Challenges affects daily users of web search and even misleads
people in certain cases, web tools should be developed to keep
aware of these challenges.
Finally every web site, should at least allow English version
language as their second language and being used more to easier the
search of information.
113. References
Miriam Fernandez | KMI, Open University, UK Thanh Tran | Institute
AIFB, KIT, DE Peter Mika Yahoo Research, Spain
Protégé-2000. Protégé Project, http://protege.stanford.edu/index.html.
2004.
V. Richard Benjamins, Jesús Contreras, Oscar Corcho and Asunción
Gómez-Pérez
Intelligent Software Components, S.A.
www.isoco.com (Spain)
Wikipedia https://en.wikipedia.org/wiki/Semantic_search
114. Thanks for your Attention!
非常感谢您的关注。
Questions Allowed.
Hinweis der Redaktion
Resource description framework (RDF) is a W3C standard for describing web resources, such as the title, author, date, content and copyright information of a web page.
资源描述框架(RDF)是用于描述网络资源的 W3C 标准,比如网页的标题、作者、修改日期、内容以及版权信息
此 RDF 文档的第一行是 XML 声明。这个 XML 声明之后是 RDF 文档的根元素:<rdf:RDF>。
xmlns:rdf 命名空间,规定了带有前缀 rdf 的元素来自命名空间 "http://www.w3.org/1999/02/22-rdf-syntax-ns#"。
xmlns:cd 命名空间,规定了带有前缀 cd 的元素来自命名空间 "http://www.recshop.fake/cd#"。
<rdf:Description> 元素包含了对被 rdf:about 属性标识的资源的描述。
元素:<cd:artist>、<cd:country>、<cd:company> 等是此资源的属性。
Web上存在大量的HTML文档,这些文档是供人类阅读的,而不是为了机器处理,没有可以被计算机利用的语义知识,需要经过语义文档的标注,使得文档中知识规范化,可被机器处理。
There are a large number of HTML documents on the Web, these documents are for human reading, not for machine processing, there is no semantic knowledge that can be used by the computer.
概括的讲,语义标注是一个在领域本体指导下为文档填加规范化知识表示的过程。即将文档中的文本知识用RDF语言描述出来,这个过程通常分成两个步骤。
1)将文档中与本体中概念相对应的词词标记出来,作为概念对应的事例,通常以RDF资源形式表示。
To mark the words that correspond to the concept of a document in the document, as a case, which is a concept corresponding to the RDF resource
2)找出这些事例当中的存在的与本体中属性相对应的关系,通常将关联的两个事例及实例间关系表示为(R1,P,R2),即一个RDF陈述。
To find out the relationship between the existence of these cases and the corresponding attributes in the ontology, the relationship between the two cases and the instance of the P is usually expressed as a RDF statement.
第一步,类型标注,type tagging TT
第二步,关系抽取,relation extraction RE
数据准备领域本体,领域词汇表
领域本体是语义标注的核心数据,本体内定义的概念、属性等元信息以及本体内预先存放的实例数据
The domain ontology is the core data of the semantic annotation,
the definition of ontology, the attribute, and the data of the pre stored in the ontology.
---将在识别阶段为类型标注器提供类型标注信息及标注列表
The type and label information and the list of the types are provided in the identification phase.
---在组合阶段被用于验证知识三元组的有效性。
In the combination phase is used to verify the validity of the knowledge of the three tuple.
领域词汇表通过统计学的方法建立,其数据来源于聚焦爬虫所下载的网页集合,在分句处理后,将数据处理为自然语言文本句子集合。
The field of vocabulary is established by statistical methods, the data sources to download web pages in the focused crawler, clause processing, data processing for natural language text sentence set.
Begin
Step1:应用识别规则,对句子中的数字、金钱、日期等通用类型实体识别,并标注类型;
Step1:Application identification rules,recognition the general purpose type entities in a sentence, Label type
Step2:应用标注词汇列表,对句子中的词汇进行精确匹配,并标注对应类型;
Step2:Application of the list of words, the words in the sentence are precisely matched, and the corresponding types are marked.
Step3:应用N元组(N-gram)切分技术,将句中词与标注词汇列表中的词进行近视匹配,对匹配成功的,标注对应类型。
Step3:Application N tuple (N-gram) segmentation technique, myopic match the sentence with the words in the annotation vocabulary list;if success, the corresponding types are marked
Step4:对于句子分词结果进行调整,保证已经标注类型的词不被切分;若已由分词程序切分,则将分开的词重新合并为一个词;将句中的数字、日期、金钱等词汇转化为与本体内数字、日期相符的规范形式,并建立原形与新形的对照表。
Begin
Step1:应用识别规则,对句子中的数字、金钱、日期等通用类型实体识别,并标注类型;
Step1:Application identification rules,recognition the general purpose type entities in a sentence, Label type
Step2:应用标注词汇列表,对句子中的词汇进行精确匹配,并标注对应类型;
Step2:Application of the list of words, the words in the sentence are precisely matched, and the corresponding types are marked.
Step3:应用N元组(N-gram)切分技术,将句中词与标注词汇列表中的词进行近视匹配,对匹配成功的,标注对应类型。
Step3:Application N tuple (N-gram) segmentation technique, myopic match the sentence with the words in the annotation vocabulary list;if success, the corresponding types are marked
Step4:对于句子分词结果进行调整,保证已经标注类型的词不被切分;若已由分词程序切分,则将分开的词重新合并为一个词;将句中的数字、日期、金钱等词汇转化为与本体内数字、日期相符的规范形式,并建立原形与新形的对照表。
Step4:The result of sentence segmentation is adjusted to ensure that the type of the word is not cut . If it has been separated from the segmentation process, the word will be merged into one word.
句中各依存对按照以Gov为父节点,Dep为子节点的形式进行连接,可以形成一棵描述句子依存关系的依存树。
The dependency pair can form a dependency tree according to parent node:Gov,the sub node :Dep.