Tovek Presentation by Livio Costantini

Livio Costantini Tovek’s Tools Software to Access Unstructured Information Auhofstrasse 25/2 1130 Wien E-mail: Livio.Costantini@Gmail.com Tel. 0043-1-8794274 Mobile: 0043-664-9919154

AGENDA ,[object Object],[object Object],[object Object]

Distinctions between Data Retrieval and Text Retrieval (1/2) ,[object Object],The type of query Query representation Criterion for success Representing data or information Data Retrieval Text Retrieval The ways of representing documents are virtually unlimited, as language is ambiguous. Effect of Semantic Indeterminacy The ways of representing data are finite ; there aren't too many variants for the term "ZIP code." Utility : as there are no or few "correct" answers, text retrieval systems ideally retrieve the most useful documents; Correctness: data retrieval systems (DBMS) should retrieve the correct answers Probabilistic relation between a formal query and the representation of adequate answer The formal search query and the user's information need are closely mapped. Deterministic relation. . . indirect and ambiguous ("I want to know about X"); a "correct" answer to your question may not even exist. .. is direct and precise ("I want to know X"); the correct answer is there, and you know it.

Distinctions between Data Retrieval and Text Retrieval (2/2) ,[object Object],A query's target area Zero or no useful results Types of searches Delegation of searching Data Retrieval Text Retrieval Open to interpretation; it's difficult to know exactly what the query was intended to retrieve. Fairly easy to do; queries are straightforward and not too dependent on context. At least three types to support: sample ("give me a few documents about X"), exhaustive ("give me everything about X""), and existence ("are there any documents about X at all?"). Just one to support: exact matching. ... a negative search result does not necessarily mean that there are no useful documents in the database. The end-point of searching. ... means that the data really doesn't exist in the database. Many ways of representing documents mean many more possible queries for that document Semantic target area is large and in large collection of documents the number of documents retrieved can overwhelm. Because there aren't many ways of representing data (unit of information) , the number of possible alternative queries for data is small, and target area is also small.

The Data Retrieval and Document Retrieval Models All the most prominent of the differences arise from the more fundamental problem of the representation of the indeterminacy The representation of the indeterminacy is a result of the effects of semantic ambiguity and system (“corpus”) size. The differences influence their design, use and management. Semantic ambiguity is a measure of the number of different senses a “word and/or phase” has. System (corpus) size is the number of time that a given “word and/or phase” is used to represent an item of information .

Generation of Text Retrieval Technology Intellectual Text Processing ,[object Object],[object Object],[object Object],[object Object],Definition and classification ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Key Benefits Criticisms

Generation of Text Retrieval Technology Automatic Text Processing ,[object Object],Boolean Retrieval Model STAIRS - IBM Natural Language Processing Probabilistic Approach Concept Retrieval Full Text Index – is a data structure that stores a list of occurrences and position of each atomic search criterion (words) , typically in the form of a hash table or binary tree, allowing full text search Concept Retrieval is a search technology which allows the possibility to search for subjects or concepts rather than individual words or phrases in documents. Retrieved documents are ranked by relevance. Usually the user is responsible for specifying the concept definition. Probabilistic Models treat the process of document retrieval as a multistage random experiment. Similarities are thus represented as probabilities. Relevance usually calculated by examining how many times a query term appears in a document compensate by the frequency of the query term in the collection. ( term frequency–inverse document frequency; tf–idf ) Based on the syntactic and morphological analysis, usually supported by a controlled dictionary. Automatic semantic network representation and free text queries. Boolean Retrieval Model (AND; OR; NOT; proximity operator ).The rank order of retrieved documents is arbitrary, no relevance assigned to each documents retrieved

What is a goal of a Text Retrieval ,[object Object],Determining relevance Capabilities Extract meaningful -useful information While Withholding non-relevant information ,[object Object],[object Object],[object Object],[object Object]

Measuring Retrieval Effectiveness - Precision & Recall ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Results Analysis of Precision and Recall of Query ,[object Object],A low precision and low recall value A high precision and low recall value A high recall and low precision value A high precision and high recall value A document is considered relevance if it is judged useful by the user who originated the query Explanation Indicate good retrieval performance of a search engine. To provide access to all and only those documents which are relevant high precision and high recall criterion for most efficient search engine. Indicates that system has retrieved a good number of relevant results but has also retrieved many irrelevant results in this process. Indicates that the system was selective and has retrieved a good number of relevant documents but missed out some important results. Indicates that the search engine has retrieved many irrelevant documents and has missed out many important results.

The Problem The 80 % of information is unstructured textual documents - Imagine it as an iceberg !!! If you can see the whole, you can become frustrated by inability to see, what can be inside. Using standard tools and basic search engines (or not using them at all) you can find only the proverbial top of the iceberg.

Verity Query Language (VQL) ,[object Object],[object Object],Evidence Proximity Relational Concept Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective. Understanding It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc. A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Order An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;

Verity Query Language (VQL) ,[object Object],[object Object],Evidence Proximity Relational Concept Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective. Understanding It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc. A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Before; After, An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;

Evidence Operators ,[object Object],<Stem> <Word> Question Mark ? ASTERISK * Expand the keyword into a list of related words Understanding <Case> Selects documents that include one or more variations of the search word you specify., e.g.: <STEM>export Note: By default words and phrases are stemmed Selects documents that include one or more instance of only the word you specify., without located stemmed variation words e.g.: <STEM>export NB. Search for documents that contains the word “ export ” but not “ exporting ” , “ exported ” , etc. Performs a case sensitive search based on the case of the word or phrase specified e.g.: EMIS (acronym for electromagnetic isotope separation) and not emis (the past participle of the French verb emiter): <CASE> EMIS Specifies one of any alphanumeric character, as in organi?ation which locates organization and organization. Specifies zero or more of any alphanumeric character, as in test* which locates not only test and tests but also testimony, testosterone etc,.

Proximity Operators ,[object Object],<Phrase> <Sentence> <Near/N> <Order> Specify relative location of specific words Understanding <Paragraph> Selects documents that include a phrase you specify. A phrase is a grouping of two or more words that occur next to each other, e.g.: <Phrase> (export, control) or “export control ” Selects documents that include all the word (s) you specify in a sentence e.g. nuclear<Sentence>research Selects documents, that include all the word (s) you specify in Paragraph e.g. Nuclear <Paragraph> Proliferation Specifies that search elements must occur in the same order as in the query statement. Always to be placed in front of an operator e.g.: ballistic <ORDER><NEAR/5> missile Selects documents containing all specified search terms within N number of words of each other, where N is an integer, e.g.: nuclear<NEAR/5>weapon

Concept Operators ,[object Object],<Accrue> <And> <NOT> Combine the meaning of search elements (words) to find a concept Understanding <OR> Selects documents that include at least one of the search elements you specify. The more search elements that are present, the higher the score will be. e.g. plutonium<ACCRUE> Pu or plutonium, Pu - Documents with both terms are listed first! Selects documents that include all search elements you specify . Documents are relevance-ranked. e.g. Germany<AND>hot cells Selects documents that include at least one of the search elements you specify. e.g. electromagnetic isotope separation<OR>EMIS<OR>calutron Note: AND, OR and NOT are treated as operators by default and do not require brackets. To use them as literal words enclose them in double quotes. All other operators must be enclosed in brackets. the <NOT > modifier followed by a word or phrase excludes documents which contain that word or phrase, e.g.: missile <AND> <NOT> short range

Relational Operators ,[object Object],Title Search in the metadata (such as Title, Date, etc.) defined in the collection Understanding Date Selects documents that include in the Title the search elements you specify. Numeric or textual search are accepted depending on the format of the fields Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc. Sort Option: The sorting of the resulted documents can be done either by score, date, or title in ascending or descending order.

Concept Retrieval - Fuzzy Logic Approach Characteristic Process of searching for subjects concepts rather than individual words or phrases In building up a concept ( Topic tree) , an expert familiar with the subject of the search assigns weights to search terms. Topic tree provide a convenient means which can encapsulate in a hierarchical structure, the knowledge of an expert. ,[object Object],[object Object],[object Object],Advantages

Design a Topic Tree - Knowledge Elicitation Process Extracting knowledge from subject area experts Subject Area Expert Knowledge Engineer The Knowledge Engineer extracts and organizes the knowledge of the Subject-Area Expert and expresses it in a hierarchic format which can be used in a “Topic Tree” environment.

Topic Tree – An Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Importance of Topic Trees Corporate intellectual property to be reused by employees, or business rules Topic Trees are available to end users as a shared resource. Topic Trees provide a convenient means which can encapsulate in a hierarchical structure the expert’s knowledge Topic Trees include all the components of the Verity Query Language (Conceptual and Proximity Operators, Modifiers and Weights) Topic Trees have the ability to understand the context of a text and retrieve documents related to a ”topic” of interest

The Accrue operator performs “the more the better” approach when assign to a topic or to a search; the more children specified by a topic using the accrue operator are found in the document, the better the document is considered related to your search. Documents which contain the maximum of highly-weighted children are the highest-ranked documents lists in the result list . Topic tree - Accrue Operator

Topic tree – Sentence; Any; Word; Stem; Operators Word operator performs the basic search and selects documents that include one or more instance of the exact word specified as search element. Stem operator increases the search to include the expanded word list, based on the original search word. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem Sentence operator is used to indicate that the children of a sub-topic must be located within the same sentence in a document Any operator is used to retrieve a document which contains at least one of the search elements specified.

Topic Trees – A knowledge representation of “Ferrari Concept ” Topic trees are predefined query in tree-like form that can be utilized for Searching, Mining and Taxonomy Classification

Topic tree - Algorithm for Scoring ,[object Object],Weight Operators Hierarchical Structure Numerical score assigned to each document in the search result list , representing how well the document meets the information need of the user that issued the search Rational Interpret the relationships between the topic-nodes and determines the whole score of the topic tree. The position of each topic-node, within the hierarchical structure, influences the calculation of the score. Operators are used in conjunction with the weight of the child (keyword) to compute the score for each topic-node during the search. Representing the relative contribution of that child (keyword) to the overall score produced by a Topic tree. The designer attributes importance weights to sub-concepts to reflect the fact that some words, phrases or other concepts are more important than others in expressing the overall concept.

Topic tree - Quality Assurance procedures and Testing process Quality Assurance Enrich the original key words Proximity operator Key words used too general Thoughts have to be made whether same keywords should be eliminated or used with new or more restrictive proximity conditions Excessively restrictive proximity conditions that did not allow combinations of keywords to contribute to the retrieval of the document in the manner expected Retrieved reports are examined for words that may serve as new keywords. Procedures to check the performance of the topic trees against a “representative” collection of reports, amongst which the reports dealing with the concepts covered by the topic trees have been identified in advance. Measuring Retrieval Effectiveness - Precision & Recall

Probabilistic Approach in Text Retrieval System ,[object Object],Synonymy Polysemy Search keywords Semantic sensitivity The probability that a specific document will be judged relevant to a specific query, is based on the assumption that the words are distributed differently in relevant and non relevant documents. The probability formula is usually derived from Bayes' theorem. Documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". Search keywords must precisely match document terms; word substrings (stemming) might result in a "false positive match" The same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents. ,[object Object]

Tovek’s Tools - Enterprise Search Engine & Analytical System Tovek Info Rating – Context Analysis & Data’s Visualisation Tool Tovek Harvester – Mine document’s context Tovek Agent – Enterprise Search Engine Tovek Index Manager – Collection Builder Tovek Editor – Create and Maintain Topic Trees Desk-top & Client - Server Application ,[object Object],[object Object],[object Object],[object Object],[object Object],Understanding

Tovek Index Manager – Collection Builder ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Users can submit sample data as input and the system returns references to related documents ranked by relevance Ability to accept all know legacy search method, including keyword search with the support of Evidence; Proximity; Relational and Concept operators alone or combined as Topic trees. ,[object Object],Capabilities Simple and Highly Structured Query Query By Example Based on the results of the natural language retrieval, users can quickly refine their search to precisely focus on the context they require Refine By Example Analyze large sets of documents or even user’s queries and automatically group relevant documents together that have a high likelihood of being relevant to the same information need Automatic Clustering Agent - Enterprise Search Engine

Tovek Agent - User Interface – Automatic Clustering Ability to create hierarchy of collections, which can be used individually or concatenated

Tovek Agent Selecting Collections - Find documents that satisfy specific criteria e.g. Nuclear , test Documents fields or Metadata Selecting collections Documents found Total documents Result List Search Pane & Search Elements

Tovek Agent – Collection Fields ,[object Object],View / Fields on the result list heading

Tovek Agent – Query History & Query in Time ,[object Object],Query history (Tools / Query History ) Possibility to execute old query

Tovek Agent – View document ,[object Object]

Examine the matched words (highlighted) in the selected document Tovek Agent - Document Proprieties

Capacity to extract highlighted words from selected documents, together with words adjacent (preceding or following) to the highlighted ones. Tovelk Agent Extract adjacent words Search Criteria : President

Tovek Agent - Multiple languages search capability

Tovek Agent – Exporting documents (Menu Tools) ,[object Object],[object Object]

Ability to export selected documents from the result list , in different format (XML HTML, text) which can be analysed further Tovek Agent - Export of selected documents ,[object Object],[object Object],[object Object],[object Object]

Tovek Query Editor For advanced users to construct more complex queries to create topic trees

InfoRating Provide a context analysis by matching an extracted list of documents against a set of queries Documents in the results list can be visualized in multiple ways InfoRating is an analytical and data visualization tool to be able to assist users in performing context analysis together with a graphical representation of aggregate documents Information are presented graphically in ways that make it easy to observe trends and general characteristic Organize documents by the criteria and categories the user has requested, the conclusions are then delivered the user Categorize documents into navigable structures to assist user in finding relevant information and in understanding the context of a collection

Connection Chart Relationships between queries and documents, together with their scores Possibility to add comments to the queries and/or documents Switches for the main pane Query pane Main pane Documents pane

Cross Matrix Upper panel - Number of documents matching all the possible permutations of two queries Lower panel – Documents matching the selected element of the Cross Matrix

Summary Graph Visualisation of the results of the queries in combination with different fields (Source or Date ) (e.g. queries within weeks)

Harvester Generation of descriptors Each keyword has assigned a weight (Relevance) Automatic assignment of keywords The tf–idf weight (term frequency–inverse document frequency) Harvester Approach The goal of Harvester is to automatically extract “relevant terms” (e.g.,keywords) from a given corpus of information ,[object Object],[object Object],[object Object],Understanding Time dependent – ( keywords and descriptors ) ,[object Object]

(Chart / Show Clusters Chart / Hide All) Harvester – Show & Hide Cluster

Harvester – Part of a Cluster

Visualization of a “Descriptor” Centrifuge and relation with Partner words Word List Word History Descriptors Words Neighborhood Working Pane Partner Words Result List

Descriptors can be used as input query in concert with Tovek’s agent

Visualization of a “Descriptor” - IAEA - and the relation with Partner words

Visualization of a “Descriptor” - Temelin - and the relation with Partner words

Tovek Presentation by Livio Costantini

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Tovek Presentation by Livio Costantini

Ähnlich wie Tovek Presentation by Livio Costantini (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tovek Presentation by Livio Costantini