SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Apache UIMA
   Introduction
Gestione delle Informazioni su Web - 2010/2011
                Tommaso Teofili
        tommaso [at] apache [dot] org
UIM ?

Unstructured Information Management

A wide topic: text, audio, video

  Different (possibly mixed) approaches
  (NLP, Machine Learning, IR, Ontologies,
  Automated reasoning, Knowledge Sources)

Apache UIMA
Apache Software Foundation

  No profit corporation

  “...provides organizational, legal, and financial
  support for a broad range of open source
  software projects...”

  “...collaborative and meritocratic development
  process...”

  “...pragmatic Apache License...”
Apache UIMA


Architectural framework to manage
unstructured data (Java, C++, ...)

Former IBM research project donated to ASF

OASIS Standard for unstructured
information management
Apache UIMA - Goals


“Our goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and video”
Apache UIMA - bridging worlds
Apache UIMA - Overview


 UIMA supports the development, discovery,
 composition and deployment of multi-modal
 analytics for the analysis of unstructured
 information and its integration with search
 technologies
Apache UIMA -
 Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
“points of view”

Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved

We are though mainly interested in text...
Sample scenario
Content Management System containing free
text articles about movies

We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to “similar” articles
(i.e.: dealing with same movies or actors)

So that we can search for “similar” articles
Sample scenario - articles
      about movies
Sample scenario

UIMA can help on enriching articles with
metadata

Think of filling an Article.java instance
variables with proper values

Then persisting it to a database to query
articles dealing with the same actors
Filling Article with metadata
Sample scenario - metadata
UIMA - Annotations
Apache UIMA -
       Annotation

The association of a metadata, such as a label,
with a region of text (or other type of artifact).

For example, the label “Person” associated with a
region of text “Fred Center” constitutes an
annotation. We say “Person” annotates the span
of text from X to Y containing exactly “Fred
Center”
Apache UIMA - Basic Steps

  Domain model definition

  Analysis pipeline definition

  Arrange components:

      Define components draining data from sources

      Add and customize analysis components: Patterns,
      Dictionaries, RegEx, External services, NLP, etc...

      Define components outputting information on target
      storages

  Analysis pipeline(s) execution
Defining domain model within
 UIMA using Type Systems

 Type System is the place where we describe which
 metadata we would like to extract

 Low representational gap

 Like almost everything in UIMA: described (and
 generated!) using XML

 Possible to define multiple Type Systems for different
 purposes
How do UIMA extract
     metadata?
Apache UIMA - Analysis
       Engines

 Basic UIMA building blocks

 Analyze a document

   Infer and record descriptive attributes
   (about documents/regions)

 Generating analysis results
Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)

Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)

Analysis algorithms can be switched changing
descriptor instead of code

Contain TypeSystems definitions

Define Capabilites
Apache UIMA -
AnalysisComponent API

 initialize : Performs (once) any startup tasks
 required by this component

 process : Process the resource to analyze
 generating analysis results (metadata)

 destroy : Frees all resources held, called only once
 when it is finished using this component
Apache UIMA -
       Annotators
Analysis Engine algorithm

  Annotator : A software component
  implemented to produce and record
  annotations over regions of an artifact
  (e.g., text document, audio, and video)

  Annotators implement AnalysisComponent
  interface
Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent

AnalysisComponent : interface for any
component responsible for analyzing artifacts

Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
Apache UIMA - AEs
Analysis Engines in a
      Pipeline
Apache UIMA - Analysis Results


  Where do analysis results end up?

  How annotators represent and share their
  results?

  CAS - Common Analysis Structure

  Maintain typed indexes of extracted results
Common Analysis Structure
Which algorithms lay
    under AEs?
Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications

It’s an AI discipline
Apache UIMA & NLP

“accomplish human-like language processing”

  Paraphrase an input text

  Translate the text into another language

  Answer questions about the contents of
  the text

  Draw inferences from the text
Apache UIMA & NLP


“an NLP-based IR system has the goal of
providing more precise, complete information
in response to a user’s real information
need”

various levels of processing
Apache UIMA -
       Approaches

Simplest : Write RegEx and Dictionaries and
mix them together

NLP-like : Tokenize -> Sentence identification
-> PoS Tagging -> Anaphora resolution ->
Named Entities Recognition -> Coreference
Identification ...
Analysis Engines in a
      Pipeline
NLP - Language Identifying

  NLP takes advantage of language specific
  syntax, forms, rules and meanings

  Not easy to write language independent
  extraction algorithms

  Often this is the first block of NLP pipelines

  Techniques: Stopwords dictionaries, statistical
  models, etc.
NLP - Tokens and Sentences

  Humans learn words’ meaning in order to
  understand whole context semantics

  Split the target text in words to be able to
  analyze their meaning and role

  Discover sentences to later assign roles to
  each token

  Easiest for English, Italian & co. but what
  about Chinese?
NLP - PoS Tagging
Assign a “Part of Speech” (noun, adjective,
verb, etc.) to each token generated in the
previous step

Many language/domain specific patterns can
be discovered and exploited just with pos-
tagged-tokens and sentences
NLP - Chunking & Parsing
 Parse sentences into a meaningful set or
 tree of relationships

 Chunks are the sentence building blocks (i.e.
 verbal forms)

 Parse tree highlights the structure of a
 sentence

 Can leverage logic analysis



    chunking                          parsing
NLP - Named Entities
       Recognition
Answer the
questions: where?
when? who? how
often? how much?

Identify key entities
in the text

Common techniques:
dictionaries, rules,
statistcal models
Debugging NER in UIMA
Using UIMA


Define TypeSystem

Define AnalysisEngine descriptor(s)

Implement Annotator(s)

Execute the UIMA pipeline
Sample scenario -
    extract actors
Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
Sample scenario -
     extract persons
I have a dictionary of names (simple to find and/or build)

I use a dictionary based Annotator to extract annotations of
first names (NameAnnotation)

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we
look for one or more (considering persons with double name or
surname) subsequent tokens whose PoS is “undefined” or a
noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a
Person (i.e. “Michael J. Fox”)
from Persons to Actors
 Getting actors can be simple if we know that
 Persons who are also actors do some well known
 actions or there exist widely used patterns

 i.e.: a Person “stars as” CharacterInTheMovie (that
 will be eventually tagged as Person too) when is
 also an Actor

 i.e.: if the snippet “CharacterInTheMovie (Person)”
 exists, then Person is usually an Actor

 then we could build an ActorAnnotator
1. Define TypeSystem

Define at least a Type inside Type System for each
object inside the domain model

Useful to define more fine grained Types (for values of
type properties, called Features)

If we want to extract information about articles we
create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies,
actors, directors, etc...
2. Define AnalysisEngine descriptor

  Define which type system it’s going to use

  Define which capabilities the analysis engine
  has: which annotations need to work and
  which annotations it’ll (eventually) generate

  Define configuration paramaters for the
  underlying algorithm

  Define resources needed by the analysis
  engine
3. Implement Annotator
 create a new class extending JCasAnnotator_ImplBase

 implement the process() method that actually does the
 job

    the algorithm implementation is (called) in the
    process() method

 you can use configuration parameters/resources defined
 in the descriptor

 eventually override initialize() and destroy() methods
DummyPersonAnnotator
4. Execute the UIMA pipeline

  Instantiate the AnalysisEngine with its
  descriptor as a parameter

  Create a CAS which will contain the text to
  be analyzed and the annotations extracted

  Run the AnalysisEngine on the given CAS

  Browse results
Execute a UIMA pipeline
What’s next


UIMA Use cases

Using UIMA in search engines

Hands on code (assignment)
References
http://www.apache.org

http://uima.apache.org

http://www.oasis-open.org

http://uima.apache.org/d/uimaj-2.3.1/index.html

http://uima.apache.org/d/uimaj-2.3.1/
overview_and_setup.html#ugr.ovv.eclipse_setup

http://www.manning.com/ingersoll/

https://github.com/tteofili/samplett/tree/master/giw1011

Weitere ähnliche Inhalte

Was ist angesagt?

[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...Product Camp Brasil
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchNETWAYS
 
Digital Marketing Portfolio
Digital Marketing PortfolioDigital Marketing Portfolio
Digital Marketing PortfolioBhagyashreekate
 
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsTechnical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsOutSystems
 
Digital Marketing Service Provider
Digital Marketing Service ProviderDigital Marketing Service Provider
Digital Marketing Service ProviderFomaxtechnology
 
Jobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerJobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerProduct School
 
Building a Culture of Experimentation at HP
Building a Culture of Experimentation at HPBuilding a Culture of Experimentation at HP
Building a Culture of Experimentation at HPOptimizely
 
How to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing StrategyHow to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing Strategyintrotodigital
 
Social Media Consulting Services Contract
Social Media Consulting Services ContractSocial Media Consulting Services Contract
Social Media Consulting Services ContractPaul Bain
 
Data Driven Design Research Personas
Data Driven Design Research PersonasData Driven Design Research Personas
Data Driven Design Research PersonasTodd Zaki Warfel
 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestKrishna Gade
 
Complete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfComplete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfKen Khan
 
2.5.3 interfaz videojuego
2.5.3 interfaz videojuego2.5.3 interfaz videojuego
2.5.3 interfaz videojuegoDiana Hernandez
 
Webloft agency credentials
Webloft agency credentialsWebloft agency credentials
Webloft agency credentialsWebloft Concepts
 
Branding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesBranding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesSlideTeam
 
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Rosenfeld Media
 

Was ist angesagt? (20)

[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e métricas para líderes de produto - Pedro Galopp...
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
 
Digital Marketing Portfolio
Digital Marketing PortfolioDigital Marketing Portfolio
Digital Marketing Portfolio
 
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsTechnical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
 
Marketo ppt (1)
Marketo   ppt (1)Marketo   ppt (1)
Marketo ppt (1)
 
Brand24
Brand24Brand24
Brand24
 
Digital Marketing Service Provider
Digital Marketing Service ProviderDigital Marketing Service Provider
Digital Marketing Service Provider
 
Jobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerJobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product Designer
 
Building a Culture of Experimentation at HP
Building a Culture of Experimentation at HPBuilding a Culture of Experimentation at HP
Building a Culture of Experimentation at HP
 
How to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing StrategyHow to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing Strategy
 
Social Media Consulting Services Contract
Social Media Consulting Services ContractSocial Media Consulting Services Contract
Social Media Consulting Services Contract
 
Data Driven Design Research Personas
Data Driven Design Research PersonasData Driven Design Research Personas
Data Driven Design Research Personas
 
UX 101: Personas
UX 101: PersonasUX 101: Personas
UX 101: Personas
 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at Pinterest
 
Complete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfComplete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdf
 
2.5.3 interfaz videojuego
2.5.3 interfaz videojuego2.5.3 interfaz videojuego
2.5.3 interfaz videojuego
 
Project Manhattan Columbia
Project Manhattan ColumbiaProject Manhattan Columbia
Project Manhattan Columbia
 
Webloft agency credentials
Webloft agency credentialsWebloft agency credentials
Webloft agency credentials
 
Branding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesBranding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation Slides
 
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
 

Andere mochten auch

UIMA
UIMAUIMA
UIMAotisg
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyLucidworks
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 University of Torino
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 

Andere mochten auch (20)

UIMA
UIMAUIMA
UIMA
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo Duboue
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
 
Pycon16 draft
Pycon16 draftPycon16 draft
Pycon16 draft
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Rule engine
Rule engineRule engine
Rule engine
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 

Ähnlich wie Apache UIMA Introduction

Analysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceAnalysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceanishahmadgrd222
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonMohammed Rafi
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...EditorJST
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research ToolsHATS
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSKristana Kane
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatiewondernet
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppratnapatil14
 

Ähnlich wie Apache UIMA Introduction (20)

Analysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceAnalysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligence
 
AI & ML
AI & MLAI & ML
AI & ML
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
 
Parser
ParserParser
Parser
 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWS
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatie
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
 

Mehr von Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 

Mehr von Tommaso Teofili (14)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 

Kürzlich hochgeladen

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Apache UIMA Introduction

  • 1. Apache UIMA Introduction Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2. UIM ? Unstructured Information Management A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA
  • 3. Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...”
  • 4. Apache UIMA Architectural framework to manage unstructured data (Java, C++, ...) Former IBM research project donated to ASF OASIS Standard for unstructured information management
  • 5. Apache UIMA - Goals “Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”
  • 6. Apache UIMA - bridging worlds
  • 7. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
  • 8. Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various “points of view” Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text...
  • 9. Sample scenario Content Management System containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors) So that we can search for “similar” articles
  • 10. Sample scenario - articles about movies
  • 11. Sample scenario UIMA can help on enriching articles with metadata Think of filling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors
  • 13. Sample scenario - metadata
  • 15. Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”
  • 16. Apache UIMA - Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution
  • 17. Defining domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes
  • 18. How do UIMA extract metadata?
  • 19. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
  • 20. Apache UIMA - AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems definitions Define Capabilites
  • 21. Apache UIMA - AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component
  • 22. Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
  • 23. Apache UIMA - Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations
  • 25. Analysis Engines in a Pipeline
  • 26. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results
  • 28. Which algorithms lay under AEs?
  • 29. Apache UIMA & NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications It’s an AI discipline
  • 30. Apache UIMA & NLP “accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
  • 31. Apache UIMA & NLP “an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need” various levels of processing
  • 32. Apache UIMA - Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identification ...
  • 33. Analysis Engines in a Pipeline
  • 34. NLP - Language Identifying NLP takes advantage of language specific syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the first block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
  • 35. NLP - Tokens and Sentences Humans learn words’ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
  • 36. NLP - PoS Tagging Assign a “Part of Speech” (noun, adjective, verb, etc.) to each token generated in the previous step Many language/domain specific patterns can be discovered and exploited just with pos- tagged-tokens and sentences
  • 37. NLP - Chunking & Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
  • 38. NLP - Named Entities Recognition Answer the questions: where? when? who? how often? how much? Identify key entities in the text Common techniques: dictionaries, rules, statistcal models
  • 40. Using UIMA Define TypeSystem Define AnalysisEngine descriptor(s) Implement Annotator(s) Execute the UIMA pipeline
  • 41. Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
  • 42. Sample scenario - extract persons I have a dictionary of names (simple to find and/or build) I use a dictionary based Annotator to extract annotations of first names (NameAnnotation) I don’t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)
  • 43. from Persons to Actors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we could build an ActorAnnotator
  • 44. 1. Define TypeSystem Define at least a Type inside Type System for each object inside the domain model Useful to define more fine grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also we’ll need to create annotations/entites for movies, actors, directors, etc...
  • 45. 2. Define AnalysisEngine descriptor Define which type system it’s going to use Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate Define configuration paramaters for the underlying algorithm Define resources needed by the analysis engine
  • 46. 3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use configuration parameters/resources defined in the descriptor eventually override initialize() and destroy() methods
  • 48. 4. Execute the UIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results
  • 49. Execute a UIMA pipeline
  • 50. What’s next UIMA Use cases Using UIMA in search engines Hands on code (assignment)