SlideShare a Scribd company logo
1 of 25
Download to read offline
◯Atsushi Keyaki†, Jun Miyazaki†
†: Tokyo Institute of Technology,
Japan
Part-­‐‑of-­‐‑speech  Tagging  for  
Web  Search  Queries  using  a  
Large-­‐‑scale  Web  Corpus
SAC2017  IAR
Objective
•  Accurate part-of-speech (POS) tagging to Web
queries
o POS tags are beneficial in accurate IR
•  Different search strategy per POS tag [1]
•  Identifying unnecessary data with POS tags [2]
o Example
•  Query: “discovery channel”
•  Doc: “Victim’s discovery is broadcasted by the channel”
2
[1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document  	
            Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998.
[2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using	
            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993.
POS  tag  mismatch  may  cause  false  positive
TV  program  (proper  nouns)
common  noun common	
noun
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
3
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
4
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
5
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
6
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Frequently  
assigned  POS  tag  
is  employed
Our  approach
•  Related study
o Using sentence-level morphological analysis of
•  Search results [3]
•  Snippet from search logs [4]
o Considering just freq. of assigned POS tags
•  Our approach
o Taking account of global statistics from large corpus
•  Easily available, considering long tail
o Considering co-occurrence of query terms
April 5, 2017SAC2017 IAR 7
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
A  small  number  of  highly  relevant  information	
User  feedback/search  log  is  not  always  available
Preliminary  investigation
•  Morphological analysis to Web queries
o Queries
•  TREC Web track topics (200 queries from 2009-2012)
o  Oracle POS tags are annotated by three assessors
o  Referring to description (information need)
o Morphological analysis tool
•  Stanford Log-linear Part-Of-Speech Tagger [5]
o Model
•  Default model
•  Caseless model
o  Not consider capitalization information during training
o  Try to solve “Capitalization is missing” problem
April 5, 2017SAC2017 IAR 8
[5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging	
            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003.
High  agreement	
Kappa:  0.98
Summary  of  error  analysis
•  Default model
o  Only half of query terms were assigned correct POS tags
o  Almost all of proper nouns were NOT identified
•  72% of proper nouns are mistakenly assigned as common nouns
•  Error: “obama”, “india”, “ritz carlton”, “discovery channel”
•  Caseless model
o  Around 75% of query terms were assigned correct POS
tags
o  Many proper nouns were identified
•  Common nouns are mistakenly identified as proper nouns
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
April 5, 2017SAC2017 IAR 9
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Proposed  POS  tagging
•  Summary of the error analysis
o  Proper nouns/common nouns cannot be identified
•  Problem1: Capitalization is missing
o  Grammatical rules are mistakenly applied
•  Problem2: Word order is fairly free
•  Related study
o  A small num. of highly relevant information
•  Problem3: User feedback and user log are not always available
•  Approach
o  Sol-P1: Sentence-level morphological analysis
o  Sol-P2: Proposing a POS tagging not based on word order
o  Sol-P3: Large-scale Web corpus (easily available)
o  Building the term-POS database (TPDB)
•  Morphological analysis are applied offline
April 5, 2017SAC2017 IAR 10
Processing  flow
April 5, 2017SAC2017 IAR 11
Large-scale
Web corpus
S1 tA/P1 tB/P2 tC/P3tA tB tC
tA tC tD
tC tE tA tF
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tF/P1
tB tD tB/P2 tD/P3
Morphological
analysis
S2
S3
S4
S1
S2
S3
S4
TPDB
tA/P1 tB/P2 tC/P3
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tA/P1
S1
S2
S3
tA tC Query
tA/P1 tC/P3
tA/P1 tC/P4
Scoring	
method
Offline Online
Insert
Scoring  for  POS  tagging
•  Design principle
o  Frequently appearing POS tags in the corpus are assigned to queries
o  POS tags of a sentence are emphasized when the sentence contains
more kinds of query terms
•  Co-occurrence of query terms is a useful clue
•  Step of scoring
o  Retrieving entries which contain query terms from TPDB
o  Braking down into pairs of query terms
•  Query: “tA tB tC”
o  Counting entries per the term-POS pairs for each query term pair
•  Query term pair: {tA tB}
o  Scoring with three proposed methods
April 5, 2017 12
{tA  tB}  {tA  tC}  {tB  tC}
tA/P1 tB/P2 5 0.33 (5/15)
tA/P1 tB/P3 3 0.20 (3/15)
tA/P2 tB/P4 7 0.47 (7/15)
freq. normalized freq. num.  of  entries  
containing  	
tA/P1 and tB/P2
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 13
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 14
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P2
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 15
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 16
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P3
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 17
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 18
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
11
Experiment
•  Datasets
o  TREC Web track topics
•  200 queries from 2009-2012
o  MS-251
•  Microsoft search log used in related studies [3][4]
•  Large-scale Web corpus
o  ClueWeb09 Category B
•  50 million Web documents
•  Evaluation methods
o  Proposed methods: MaxFreq, MostLikelihood, AllCombi
o  Existing methods: Stanford, Caseless, SingleFreq
April 5, 2017SAC2017 IAR 19
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
The  most  frequently  appearing  POS  tag  is  assigned
Skip  because  the  trend  is  the  same
POS-­‐‑tagged  Web  track  topics
•  AllCombi: the highest for all terms, common noun, proper noun
o  Good at judging nouns
o  Considering more diversified context is useful
•  Global statistics from large-scale Web corpus is useful
•  MaxFreq and MostLikelihood: the highest for common noun, verb,
adjective
•  Every proposed method significantly outperformed (VS Caseless)
April 5, 2017SAC2017 IAR 20
Precision All  query	
terms
Common	
noun
Proper	
noun
Verb Adjective sign  test  with	
Caseless
MaxFreq .814 .825 .833 .769 .647 p  <  0.05
MostLikelihood .814 .825 .833 .769 .647 p  <  0.05
AllCombi .821 .825 .860 .714 .629 p  <  0.01
Caseless .763 .789 .751 .733 .690
SingleFreq .702 .775 .670 .533 .581
Stanford .547 .550 1.0 .722 .451
Effect  of  the  proposed  method
•  AllCombi correctly identified many query terms
•  Some errors by partial grammatical rules still remain
•  Negative effects of the proposed method
o  “president” in the corpus are often identified as proper
nouns
•  Need to normalize term weights
April 5, 2017SAC2017 IAR 21
Query Stanford AllCombi
obama
india
rif  carlton
lower  heart  rate
gs  pay  rate
president  united  states
Conclusion
•  POS tagging to Web queries
o  Results of sentence-level morphological analysis
o  Large-scale Web corpus
o  Proposed three scoring methods
•  Experiments
o  Considering more diversified context is useful
o  The best proposed method differs by POS tag
o  Overwhelmed existing tools and existing studies
•  Future work
o  Combination of proposed methods may improve accuracy
o  Database schema design for fast POS tagging
April 5, 2017SAC2017 IAR 22
Default  model
April 5, 2017SAC2017 IAR 23
POS  tags Precision Recall
Common  noun .550 .985
Proper  noun 1.0 .010
Verb .722 .867
Adjective .451 .958
All  query  terms .547 .547
•  Nearly half of query terms
were assigned correct POS tags
•  Almost all of proper nouns
were not identified
o  72% of proper nouns are
mistakenly assigned as common
nouns
o  Error: “obama”, “india”, “ritz
carlton”, “discovery channel”
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Caseless  model
•  Precision and recall improved overall
•  Many proper nouns were identified
o  31% of proper nouns are mistakenly assigned as common nouns
o  Precision is decreased
•  Harm of partial grammatical rules still exist
o  “discovery channel store”
April 5, 2017SAC2017 IAR 24
common  noun proper  noun
POS  tags Precision Recall
Common  noun .789 .769
Proper  noun .751 .640
Verb .733 .733
Adjective .690 .833
All  query  terms .763 .763
MS-­‐‑251
•  The trend of the proposed methods is the same
o The ratio of POS tags affected the order
•  AllCombi
•  MaxFreq, MostLikelihood
o The proposed methods are better than [4]
April 5, 2017SAC2017 IAR 25
Precision
MaxFreq .890
MostLikelihood .895
AllCombi .893
the  best  method  in  [4] .858
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Good  at  judging  nouns
Good  at  judging  verb,  adjective

More Related Content

What's hot

Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsSimon Dew
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolutionivan provalov
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Deep natural language processing in search systems
Deep natural language processing in search systemsDeep natural language processing in search systems
Deep natural language processing in search systemsBill Liu
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...Association for Computational Linguistics
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanMasud Rahman
 

What's hot (17)

Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documents
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Deep natural language processing in search systems
Deep natural language processing in search systemsDeep natural language processing in search systems
Deep natural language processing in search systems
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Phd tesis olga giraldo 10mayo
Phd tesis olga giraldo 10mayoPhd tesis olga giraldo 10mayo
Phd tesis olga giraldo 10mayo
 
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
master_thesis_greciano_v2
master_thesis_greciano_v2master_thesis_greciano_v2
master_thesis_greciano_v2
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud Rahman
 

Similar to Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...NextMove Software
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersKevin Lee
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGEUNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGEPrasadu Peddi
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
C:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through ConversationC:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through Conversationstacycj
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang SpecJing Kang
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Steffen Staab
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of InformationAdrian Paschke
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 

Similar to Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus (20)

Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
SIGIR 2011
SIGIR 2011SIGIR 2011
SIGIR 2011
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGEUNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
C:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through ConversationC:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through Conversation
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 

Recently uploaded

Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxlionnarsimharajumjf
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.thamaeteboho94
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 

Recently uploaded (17)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

  • 1. ◯Atsushi Keyaki†, Jun Miyazaki† †: Tokyo Institute of Technology, Japan Part-­‐‑of-­‐‑speech  Tagging  for   Web  Search  Queries  using  a   Large-­‐‑scale  Web  Corpus SAC2017  IAR
  • 2. Objective •  Accurate part-of-speech (POS) tagging to Web queries o POS tags are beneficial in accurate IR •  Different search strategy per POS tag [1] •  Identifying unnecessary data with POS tags [2] o Example •  Query: “discovery channel” •  Doc: “Victim’s discovery is broadcasted by the channel” 2 [1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document              Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998. [2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993. POS  tag  mismatch  may  cause  false  positive TV  program  (proper  nouns) common  noun common noun
  • 3. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 3 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” Query          :  “rif  carlton”
  • 4. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 4 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun Query          :  “rif  carlton”
  • 5. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 5 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language
  • 6. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 6 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language Frequently   assigned  POS  tag   is  employed
  • 7. Our  approach •  Related study o Using sentence-level morphological analysis of •  Search results [3] •  Snippet from search logs [4] o Considering just freq. of assigned POS tags •  Our approach o Taking account of global statistics from large corpus •  Easily available, considering long tail o Considering co-occurrence of query terms April 5, 2017SAC2017 IAR 7 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. A  small  number  of  highly  relevant  information User  feedback/search  log  is  not  always  available
  • 8. Preliminary  investigation •  Morphological analysis to Web queries o Queries •  TREC Web track topics (200 queries from 2009-2012) o  Oracle POS tags are annotated by three assessors o  Referring to description (information need) o Morphological analysis tool •  Stanford Log-linear Part-Of-Speech Tagger [5] o Model •  Default model •  Caseless model o  Not consider capitalization information during training o  Try to solve “Capitalization is missing” problem April 5, 2017SAC2017 IAR 8 [5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003. High  agreement Kappa:  0.98
  • 9. Summary  of  error  analysis •  Default model o  Only half of query terms were assigned correct POS tags o  Almost all of proper nouns were NOT identified •  72% of proper nouns are mistakenly assigned as common nouns •  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Caseless model o  Around 75% of query terms were assigned correct POS tags o  Many proper nouns were identified •  Common nouns are mistakenly identified as proper nouns •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” April 5, 2017SAC2017 IAR 9 verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 10. Proposed  POS  tagging •  Summary of the error analysis o  Proper nouns/common nouns cannot be identified •  Problem1: Capitalization is missing o  Grammatical rules are mistakenly applied •  Problem2: Word order is fairly free •  Related study o  A small num. of highly relevant information •  Problem3: User feedback and user log are not always available •  Approach o  Sol-P1: Sentence-level morphological analysis o  Sol-P2: Proposing a POS tagging not based on word order o  Sol-P3: Large-scale Web corpus (easily available) o  Building the term-POS database (TPDB) •  Morphological analysis are applied offline April 5, 2017SAC2017 IAR 10
  • 11. Processing  flow April 5, 2017SAC2017 IAR 11 Large-scale Web corpus S1 tA/P1 tB/P2 tC/P3tA tB tC tA tC tD tC tE tA tF tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tF/P1 tB tD tB/P2 tD/P3 Morphological analysis S2 S3 S4 S1 S2 S3 S4 TPDB tA/P1 tB/P2 tC/P3 tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tA/P1 S1 S2 S3 tA tC Query tA/P1 tC/P3 tA/P1 tC/P4 Scoring method Offline Online Insert
  • 12. Scoring  for  POS  tagging •  Design principle o  Frequently appearing POS tags in the corpus are assigned to queries o  POS tags of a sentence are emphasized when the sentence contains more kinds of query terms •  Co-occurrence of query terms is a useful clue •  Step of scoring o  Retrieving entries which contain query terms from TPDB o  Braking down into pairs of query terms •  Query: “tA tB tC” o  Counting entries per the term-POS pairs for each query term pair •  Query term pair: {tA tB} o  Scoring with three proposed methods April 5, 2017 12 {tA  tB}  {tA  tC}  {tB  tC} tA/P1 tB/P2 5 0.33 (5/15) tA/P1 tB/P3 3 0.20 (3/15) tA/P2 tB/P4 7 0.47 (7/15) freq. normalized freq. num.  of  entries   containing   tA/P1 and tB/P2
  • 13. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 13 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq.
  • 14. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 14 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P2 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 15. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 15 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 16. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 16 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P3 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 17. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 17 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 18. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 18 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 11
  • 19. Experiment •  Datasets o  TREC Web track topics •  200 queries from 2009-2012 o  MS-251 •  Microsoft search log used in related studies [3][4] •  Large-scale Web corpus o  ClueWeb09 Category B •  50 million Web documents •  Evaluation methods o  Proposed methods: MaxFreq, MostLikelihood, AllCombi o  Existing methods: Stanford, Caseless, SingleFreq April 5, 2017SAC2017 IAR 19 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. The  most  frequently  appearing  POS  tag  is  assigned Skip  because  the  trend  is  the  same
  • 20. POS-­‐‑tagged  Web  track  topics •  AllCombi: the highest for all terms, common noun, proper noun o  Good at judging nouns o  Considering more diversified context is useful •  Global statistics from large-scale Web corpus is useful •  MaxFreq and MostLikelihood: the highest for common noun, verb, adjective •  Every proposed method significantly outperformed (VS Caseless) April 5, 2017SAC2017 IAR 20 Precision All  query terms Common noun Proper noun Verb Adjective sign  test  with Caseless MaxFreq .814 .825 .833 .769 .647 p  <  0.05 MostLikelihood .814 .825 .833 .769 .647 p  <  0.05 AllCombi .821 .825 .860 .714 .629 p  <  0.01 Caseless .763 .789 .751 .733 .690 SingleFreq .702 .775 .670 .533 .581 Stanford .547 .550 1.0 .722 .451
  • 21. Effect  of  the  proposed  method •  AllCombi correctly identified many query terms •  Some errors by partial grammatical rules still remain •  Negative effects of the proposed method o  “president” in the corpus are often identified as proper nouns •  Need to normalize term weights April 5, 2017SAC2017 IAR 21 Query Stanford AllCombi obama india rif  carlton lower  heart  rate gs  pay  rate president  united  states
  • 22. Conclusion •  POS tagging to Web queries o  Results of sentence-level morphological analysis o  Large-scale Web corpus o  Proposed three scoring methods •  Experiments o  Considering more diversified context is useful o  The best proposed method differs by POS tag o  Overwhelmed existing tools and existing studies •  Future work o  Combination of proposed methods may improve accuracy o  Database schema design for fast POS tagging April 5, 2017SAC2017 IAR 22
  • 23. Default  model April 5, 2017SAC2017 IAR 23 POS  tags Precision Recall Common  noun .550 .985 Proper  noun 1.0 .010 Verb .722 .867 Adjective .451 .958 All  query  terms .547 .547 •  Nearly half of query terms were assigned correct POS tags •  Almost all of proper nouns were not identified o  72% of proper nouns are mistakenly assigned as common nouns o  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 24. Caseless  model •  Precision and recall improved overall •  Many proper nouns were identified o  31% of proper nouns are mistakenly assigned as common nouns o  Precision is decreased •  Harm of partial grammatical rules still exist o  “discovery channel store” April 5, 2017SAC2017 IAR 24 common  noun proper  noun POS  tags Precision Recall Common  noun .789 .769 Proper  noun .751 .640 Verb .733 .733 Adjective .690 .833 All  query  terms .763 .763
  • 25. MS-­‐‑251 •  The trend of the proposed methods is the same o The ratio of POS tags affected the order •  AllCombi •  MaxFreq, MostLikelihood o The proposed methods are better than [4] April 5, 2017SAC2017 IAR 25 Precision MaxFreq .890 MostLikelihood .895 AllCombi .893 the  best  method  in  [4] .858 [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Good  at  judging  nouns Good  at  judging  verb,  adjective