SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   1 / 52




DCU at the NTCIR-9 SpokenDoc
   Passage Retrieval Task

   Maria Eskevich, Gareth J.F. Jones

       Centre for Digital Video Processing
      Centre for Next Generation Localisation
               School of Computing
               Dublin City University
                      Ireland


              December, 8, 2011
Retrieval Methodology   DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   2 / 52


  Outline

         Retrieval Methodology
            Transcript Preprocessing
            Text Segmentation
            Retrieval Setup

         Results
           Official Metrics
           uMAP
           pwMAP
           fMAP

         Conclusions
Retrieval Methodology     DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   3 / 52


  Retrieval Methodology


             Transcript
Retrieval Methodology         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   4 / 52


  Retrieval Methodology


             Transcript



                        Segmentation



             Topically
             Coherent
             Segments
Retrieval Methodology         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   5 / 52


  Retrieval Methodology


             Transcript



                        Segmentation



             Topically
                                       Indexed
             Coherent
             Segments         Indexing Segments
Retrieval Methodology         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task    6 / 52


  Retrieval Methodology


             Transcript                                  Queries



                        Segmentation                           Information Request



             Topically
                                       Indexed
             Coherent
             Segments         Indexing Segments
Retrieval Methodology         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task               7 / 52


  Retrieval Methodology


             Transcript                                  Queries



                        Segmentation                           Information Request



             Topically
                                       Indexed                                      Retrieval
             Coherent
             Segments         Indexing Segments Retrieval                           Results
Retrieval Methodology       DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task                8 / 52


  6 Retrieval Runs


                                      1-best ASR with                             Manual
               1-best ASR
                                      stop words removed                          Transcript
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task                9 / 52


  6 Retrieval Runs


                                       1-best ASR with                             Manual
               1-best ASR
                                       stop words removed                          Transcript

             C99

            ASR C99         TT




                        ASR TT
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task                10 / 52


  6 Retrieval Runs


                                       1-best ASR with                             Manual
               1-best ASR
                                       stop words removed                          Transcript

             C99                         C99

            ASR C99         TT       ASR NSW C99                   TT



                        ASR TT                 ASR NSW TT
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task                     11 / 52


  6 Retrieval Runs


                                       1-best ASR with                               Manual
               1-best ASR
                                       stop words removed                            Transcript

             C99                         C99                                       C99

            ASR C99         TT       ASR NSW C99                   TT          Manual C99         TT



                        ASR TT                 ASR NSW TT                                Manual TT
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task               12 / 52


  6 Retrieval Runs




            ASR C99                  ASR NSW C99                               Manual C99




                        ASR TT                 ASR NSW TT                          Manual TT
Retrieval Methodology      DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   13 / 52


  Transcript Preprocessing



                 Recognize individual morphemes of the sentences:
                 ChaSen 2.4.0, based on Japanese morphological analyzer
                 JUMAN 2.0 with ipadic grammar 2.7.0
Retrieval Methodology       DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   14 / 52


  Transcript Preprocessing



                 Recognize individual morphemes of the sentences:
                 ChaSen 2.4.0, based on Japanese morphological analyzer
                 JUMAN 2.0 with ipadic grammar 2.7.0
                 Form the text out of the base forms of the words in order to
                 avoid stemming
Retrieval Methodology       DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   15 / 52


  Transcript Preprocessing



                 Recognize individual morphemes of the sentences:
                 ChaSen 2.4.0, based on Japanese morphological analyzer
                 JUMAN 2.0 with ipadic grammar 2.7.0
                 Form the text out of the base forms of the words in order to
                 avoid stemming
                 Remove the stop words (SpeedBlog Japanese
                 Stop-words) for one of the runs
Retrieval Methodology          DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   16 / 52


  Text Segmentation


         Use of the algorithms originally developed for text:
         Individual IPUs are treated as sentences
                 TextTiling:
                        Cosine similarities between adjacent blocks of sentences
                 C99:
                        Compute similarity between sentences using a cosine
                        similarity measure to form a similarity matrix
                        Cosine scores are replaced by the rank of the score in the
                        local region
                        Segmentation points are assigned using a clustering
                        procedure
Retrieval Methodology   DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   17 / 52


  Retrieval Setup
         SMART information retrieval system extended to use language
         modelling with a uniform document prior probability.
Retrieval Methodology      DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   18 / 52


  Retrieval Setup
         SMART information retrieval system extended to use language
         modelling with a uniform document prior probability.
         A query q is scored against a document d within the SMART
         framework in the following way:
                                      n
                        P(q|d) =          (λi P(qi |d) + (1 − λi )P(qi ))
                                    i=1
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   19 / 52


  Retrieval Setup
         SMART information retrieval system extended to use language
         modelling with a uniform document prior probability.
         A query q is scored against a document d within the SMART
         framework in the following way:
                                        n
                         P(q|d) =           (λi P(qi |d) + (1 − λi )P(qi ))
                                      i=1

         where
                 q = (q1 , . . . qn ) is a query comprising of n query terms,
                 P(qi |d) is the probability of generating the i th query term
                 from a given document d being estimated by the maximum
                 likelihood,
                 P(qi ) is the probability of generating it from the collection
                 and is estimated by document frequency
Retrieval Methodology        DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   20 / 52


  Retrieval Setup
         SMART information retrieval system extended to use language
         modelling with a uniform document prior probability.
         A query q is scored against a document d within the SMART
         framework in the following way:
                                        n
                         P(q|d) =           (λi P(qi |d) + (1 − λi )P(qi ))
                                      i=1

         where
                 q = (q1 , . . . qn ) is a query comprising of n query terms,
                 P(qi |d) is the probability of generating the i th query term
                 from a given document d being estimated by the maximum
                 likelihood,
                 P(qi ) is the probability of generating it from the collection
                 and is estimated by document frequency
         The retrieval model used λi = 0.3 for all qi , this value
         being optimized on the TREC-8 ad hoc dataset.
Results                  DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   21 / 52


  Outline

          Retrieval Methodology
             Transcript Preprocessing
             Text Segmentation
             Retrieval Setup

          Results
            Official Metrics
            uMAP
            pwMAP
            fMAP

          Conclusions
Results             DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task            22 / 52


  Results: Official Metrics

          Transcript Segmentation              uMAP           pwMAP       fMAP
             type        type
                  BASELINE                     0.0670          0.0520     0.0536
           manual          tt                  0.0859          0.0429     0.0500
           manual        C99                   0.0713          0.0209     0.0168
             ASR           tt                  0.0490          0.0329     0.0308
             ASR         C99                   0.0469          0.0166     0.0123
          ASR nsw          tt                  0.0312          0.0141     0.0174
          ASR nsw        C99                   0.0316          0.0138     0.0120
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task            23 / 52


  Results: Official Metrics

          Transcript Segmentation               uMAP           pwMAP       fMAP
             type        type
                  BASELINE                      0.0670          0.0520     0.0536
           manual          tt                   0.0859          0.0429     0.0500
           manual        C99                    0.0713          0.0209     0.0168
             ASR           tt                   0.0490          0.0329     0.0308
             ASR         C99                    0.0469          0.0166     0.0123
          ASR nsw          tt                   0.0312          0.0141     0.0174
          ASR nsw        C99                    0.0316          0.0138     0.0120

          Only runs on the manual transcript had higher scores than
          the baseline (only uMAP metric)
Results               DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task            24 / 52


  Results: Official Metrics

          Transcript Segmentation                uMAP           pwMAP       fMAP
             type        type
                  BASELINE                       0.0670          0.0520     0.0536
           manual          tt                    0.0859          0.0429     0.0500
           manual        C99                     0.0713          0.0209     0.0168
             ASR           tt                    0.0490          0.0329     0.0308
             ASR         C99                     0.0469          0.0166     0.0123
          ASR nsw          tt                    0.0312          0.0141     0.0174
          ASR nsw        C99                     0.0316          0.0138     0.0120

          Only runs on the manual transcript had higher scores than
          the baseline (only uMAP metric)
          TextTiling results are consistently higher than C99 for all
          the metrics for manual and ASR runs
Results                   DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   25 / 52


  Time-based Results Assessment Approach
          For each run and each query:


               Relevant         Passage 1

                       Relevant         Passage 2

                         Passage 3

                   Relevant     Passage 4

                              ...


          where:

                                      Length of the Relevant Part
                     Precision =
                                     Length of the Whole Passage
Results                   DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   26 / 52


  Time-based Results Assessment Approach
          For each run and each query:


               Relevant         Passage 1                        Precision

                       Relevant         Passage 2                Precision

                         Passage 3

                   Relevant     Passage 4                        Precision

                              ...


          where:

                                      Length of the Relevant Part
                     Precision =
                                     Length of the Whole Passage
Results                   DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task         27 / 52


  Time-based Results Assessment Approach
          For each run and each query:
                                                                                Average
               Relevant         Passage 1                        Precision      of
                                                                                Precision
                                                                                per Query
                       Relevant         Passage 2                Precision
                                                                                for ranks
                                                                                with
                         Passage 3
                                                                                passages
                                                                                containing
                   Relevant     Passage 4                        Precision      relevant
                                                                                content
                              ...


          where:

                                      Length of the Relevant Part
                     Precision =
                                     Length of the Whole Passage
Results          DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   28 / 52


  Average of Precision for all passages with relevant
  content
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   29 / 52


  Average of Precision for all passages with relevant
  content




          TextTiling algorithm has higher average of precision for all
          types of transcript, i.e.topically coherent segments
          are better located
Results         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        30 / 52


  Results: utterance-based MAP (uMAP)
          Transcript type Segmentation type                       uMAP
                       BASELINE                                   0.0670
              manual             tt                               0.0859
              manual            C99                               0.0713
               ASR               tt                               0.0490
               ASR              C99                               0.0469
             ASR nsw             tt                               0.0312
             ASR nsw            C99                               0.0316
Results             DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        31 / 52


  Results: utterance-based MAP (uMAP)
              Transcript type Segmentation type                       uMAP
                           BASELINE                                   0.0670
                  manual             tt                               0.0859
                  manual            C99                               0.0713
                   ASR               tt                               0.0490
                   ASR              C99                               0.0469
                 ASR nsw             tt                               0.0312
                 ASR nsw            C99                               0.0316

          The trend ‘manual > ASR > ASR nsw’ for both C99 and
          TextTiling is not proved by the averages of precision
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        32 / 52


  Results: utterance-based MAP (uMAP)
              Transcript type Segmentation type                        uMAP
                           BASELINE                                    0.0670
                  manual             tt                                0.0859
                  manual            C99                                0.0713
                   ASR               tt                                0.0490
                   ASR              C99                                0.0469
                 ASR nsw             tt                                0.0312
                 ASR nsw            C99                                0.0316

          The trend ‘manual > ASR > ASR nsw’ for both C99 and
          TextTiling is not proved by the averages of precision
          Higher average values of the TextTiling segmentation over
          C99 are not reflected in the uMAP scores
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        33 / 52


  Results: utterance-based MAP (uMAP)
              Transcript type Segmentation type                        uMAP
                           BASELINE                                    0.0670
                  manual             tt                                0.0859
                  manual            C99                                0.0713
                   ASR               tt                                0.0490
                   ASR              C99                                0.0469
                 ASR nsw             tt                                0.0312
                 ASR nsw            C99                                0.0316

          The trend ‘manual > ASR > ASR nsw’ for both C99 and
          TextTiling is not proved by the averages of precision
          Higher average values of the TextTiling segmentation over
          C99 are not reflected in the uMAP scores
          For some of the queries runs on C99 segmentation
          have better ranking of the segments with relevant
          content
Results            DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      34 / 52


  Relevance of the Central IPU Assessment
              Number of ranks                     Average of Precision
             taken or not taken                for the passages at ranks
          into account by pwMAP               that are taken or not taken
                                                into account by pwMAP
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      35 / 52


  Relevance of the Central IPU Assessment
              Number of ranks                       Average of Precision
             taken or not taken                  for the passages at ranks
          into account by pwMAP                 that are taken or not taken
                                                  into account by pwMAP




          TextTiling has higher numbers of segments that have
          central IPU relevant to the query
Results             DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      36 / 52


  Relevance of the Central IPU Assessment
              Number of ranks                      Average of Precision
             taken or not taken                 for the passages at ranks
          into account by pwMAP                that are taken or not taken
                                                 into account by pwMAP




          TextTiling has higher numbers of segments that have
          central IPU relevant to the query
          Overall the numbers of the ranks where the segment with
          relevant is retrieved is approximately the same
          for both segmentation techniques
Results         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        37 / 52


  Results: pointwise MAP (pwMAP)


          Transcript type Segmentation type                       pwMAP
                       BASELINE                                   0.0520
              manual             tt                               0.0429
              manual            C99                               0.0209
               ASR               tt                               0.0329
               ASR              C99                               0.0166
             ASR nsw             tt                               0.0141
             ASR nsw            C99                               0.0138
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        38 / 52


  Results: pointwise MAP (pwMAP)


              Transcript type Segmentation type                        pwMAP
                           BASELINE                                    0.0520
                  manual             tt                                0.0429
                  manual            C99                                0.0209
                   ASR               tt                                0.0329
                   ASR              C99                                0.0166
                 ASR nsw             tt                                0.0141
                 ASR nsw            C99                                0.0138

          TextTiling segmentation puts better topic boundaries for
          relevant content and have higher precision scores for the
          retrieved relevant passages
Results                DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   39 / 52


  Average Length of Relevant Part and Segments
  (in seconds)
      Center IPU is relevant
Results                DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   40 / 52


  Average Length of Relevant Part and Segments
  (in seconds)
      Center IPU is relevant




            Center IPU is relevant: Average length of the relevant
            content is of the same order for both segmentation
            schemes, slightly higher for TextTiling
Results         DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      41 / 52


  Average Length of Relevant Part and Segments
  (in seconds)
                                            Center IPU is not relevant
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      42 / 52


  Average Length of Relevant Part and Segments
  (in seconds)
                                                 Center IPU is not relevant




          Center IPU is not relevant: Average length of the relevant
          content is higher for C99 segmentation,
          due to the poor segmentation it correlates
          with much longer segments
Results                DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task      43 / 52


  Average Length of Relevant Part and Segments
  (in seconds)
      Center IPU is relevant                       Center IPU is not relevant




            Center IPU is relevant: Average length of the relevant
            content is of the same order for both segmentation
            schemes, slightly higher for TextTiling
            Center IPU is not relevant: Average length of the relevant
            content is higher for C99 segmentation,
            due to the poor segmentation it correlates
            with much longer segments
Results          DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        44 / 52


  Results: fraction MAP (fMAP)

           Transcript type Segmentation type                        fMAP
                        BASELINE                                   0.0536
               manual             tt                               0.0500
               manual            C99                               0.0168
                ASR               tt                               0.0308
                ASR              C99                               0.0123
              ASR nsw             tt                               0.0174
              ASR nsw            C99                               0.0120
Results              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task        45 / 52


  Results: fraction MAP (fMAP)

              Transcript type Segmentation type                         fMAP
                           BASELINE                                    0.0536
                  manual             tt                                0.0500
                  manual            C99                                0.0168
                   ASR               tt                                0.0308
                   ASR              C99                                0.0123
                 ASR nsw             tt                                0.0174
                 ASR nsw            C99                                0.0120

          Average number of ranks with segments having
          non-relevant center IPU is more than 5 times higher
          Segmentation technique with longer poor segmented
          passages (C99) has much lower precision-based scores
Conclusions             DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   46 / 52


  Outline

         Retrieval Methodology
            Transcript Preprocessing
            Text Segmentation
            Retrieval Setup

         Results
           Official Metrics
           uMAP
           pwMAP
           fMAP

         Conclusions
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   47 / 52


  Conclusions



              TextTiling segmentation shows better overall retrieval
              performance than C99:
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   48 / 52


  Conclusions



              TextTiling segmentation shows better overall retrieval
              performance than C99:
                  Higher numbers of segments with higher precision
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   49 / 52


  Conclusions



              TextTiling segmentation shows better overall retrieval
              performance than C99:
                  Higher numbers of segments with higher precision
                  Higher precision even for the segments with non-relevant
                  center IPU
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task    50 / 52


  Conclusions



              TextTiling segmentation shows better overall retrieval
              performance than C99:
                  Higher numbers of segments with higher precision
                  Higher precision even for the segments with non-relevant
                  center IPU
                  High level of poor segmentation makes it harder to retrieve
                  relevant content for C99 runs
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task    51 / 52


  Conclusions



              TextTiling segmentation shows better overall retrieval
              performance than C99:
                  Higher numbers of segments with higher precision
                  Higher precision even for the segments with non-relevant
                  center IPU
                  High level of poor segmentation makes it harder to retrieve
                  relevant content for C99 runs
              Removal of stop words before segmentation did not have
              any positive effect on the results
Conclusions              DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task   52 / 52




         Thank you for your attention!

         Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Collections3b datatransferstreams
Collections3b datatransferstreamsCollections3b datatransferstreams
Collections3b datatransferstreamsDilip Krishna
 
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Wwwguest3f9c6b
 
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...Rob Shakir
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defensemarek_pomocka
 
Question bank of module v tcpip
Question bank of module v tcpipQuestion bank of module v tcpip
Question bank of module v tcpipEngineering
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
 
Sensors and Samples: A Homological Approach
Sensors and Samples:  A Homological ApproachSensors and Samples:  A Homological Approach
Sensors and Samples: A Homological ApproachDon Sheehy
 
Jushoraj_Profile (1)
Jushoraj_Profile (1)Jushoraj_Profile (1)
Jushoraj_Profile (1)Jushoraj V
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
Simple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud StorageSimple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud StorageKevin Tong
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Red Over
 
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...VIT-AP University
 
Siddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertationSiddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertationSiddhiKadam10
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderRSIS International
 

Was ist angesagt? (20)

Collections3b datatransferstreams
Collections3b datatransferstreamsCollections3b datatransferstreams
Collections3b datatransferstreams
 
79 83
79 8379 83
79 83
 
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
 
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...
Reinforcing the Kitchen Sink - Aligning BGP-4 Error Handling with Modern Netw...
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Question bank of module v tcpip
Question bank of module v tcpipQuestion bank of module v tcpip
Question bank of module v tcpip
 
Cpascoe pimms or2012_
Cpascoe pimms or2012_Cpascoe pimms or2012_
Cpascoe pimms or2012_
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
 
Sensors and Samples: A Homological Approach
Sensors and Samples:  A Homological ApproachSensors and Samples:  A Homological Approach
Sensors and Samples: A Homological Approach
 
Jushoraj_Profile (1)
Jushoraj_Profile (1)Jushoraj_Profile (1)
Jushoraj_Profile (1)
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
[13] Nup 07 5
[13] Nup 07 5[13] Nup 07 5
[13] Nup 07 5
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Dq31784792
Dq31784792Dq31784792
Dq31784792
 
Simple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud StorageSimple regenerating codes: Network Coding for Cloud Storage
Simple regenerating codes: Network Coding for Cloud Storage
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
 
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...
Novel Tree Structure Based Conservative Reversible Binary Coded Decimal Adder...
 
Siddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertationSiddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertation
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar Decoder
 

Andere mochten auch

Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...
Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...
Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...Maria Eskevich
 
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...Maria Eskevich
 
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)Maria Eskevich
 
Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012Maria Eskevich
 
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...Maria Eskevich
 
Audio/Video Search: Why? What? How?
Audio/Video Search: Why? What? How?Audio/Video Search: Why? What? How?
Audio/Video Search: Why? What? How?Maria Eskevich
 
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...Hitesh Gossain
 
Photostory slideshow
Photostory slideshow Photostory slideshow
Photostory slideshow milo197
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Maria Eskevich
 
Levima - Company profile
Levima - Company profileLevima - Company profile
Levima - Company profileLevima
 
Edinburgh University H&S presentation
Edinburgh University H&S presentationEdinburgh University H&S presentation
Edinburgh University H&S presentationLevima
 
Focus on spoken content in multimedia retrieval
Focus on spoken content in multimedia retrievalFocus on spoken content in multimedia retrieval
Focus on spoken content in multimedia retrievalMaria Eskevich
 
Projects list for ece & eee
Projects list for ece & eeeProjects list for ece & eee
Projects list for ece & eeevarun29071
 
designs codes - Company Presentation
designs codes - Company Presentationdesigns codes - Company Presentation
designs codes - Company PresentationSubhomay Banerjee
 
Innehållet är navet i din digitala närvaro
Innehållet är navet i din digitala närvaroInnehållet är navet i din digitala närvaro
Innehållet är navet i din digitala närvaroKristofer Lundqvist
 
презентація абрамов
презентація абрамовпрезентація абрамов
презентація абрамовghjntpf
 

Andere mochten auch (20)

Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...
Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...
Towards Methods for Efficient Access to Spoken Content in the AMI Corpus (SSC...
 
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...
Comparing Retrieval Effectiveness of Alternative Content Segmentation Methods...
 
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
 
Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012Search and Hyperlinking Task at MediaEval 2012
Search and Hyperlinking Task at MediaEval 2012
 
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...
New Metrics for Meaningful Evaluation of Informally Structured Speech Retriev...
 
Audio/Video Search: Why? What? How?
Audio/Video Search: Why? What? How?Audio/Video Search: Why? What? How?
Audio/Video Search: Why? What? How?
 
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...
Getting Sponsorship - Standard Presentation (Created by Onspon.com - India's ...
 
Photostory slideshow
Photostory slideshow Photostory slideshow
Photostory slideshow
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
Levima - Company profile
Levima - Company profileLevima - Company profile
Levima - Company profile
 
Edinburgh University H&S presentation
Edinburgh University H&S presentationEdinburgh University H&S presentation
Edinburgh University H&S presentation
 
Focus on spoken content in multimedia retrieval
Focus on spoken content in multimedia retrievalFocus on spoken content in multimedia retrieval
Focus on spoken content in multimedia retrieval
 
Projects list for ece & eee
Projects list for ece & eeeProjects list for ece & eee
Projects list for ece & eee
 
designs codes - Company Presentation
designs codes - Company Presentationdesigns codes - Company Presentation
designs codes - Company Presentation
 
Innehållet är navet i din digitala närvaro
Innehållet är navet i din digitala närvaroInnehållet är navet i din digitala närvaro
Innehållet är navet i din digitala närvaro
 
презентація абрамов
презентація абрамовпрезентація абрамов
презентація абрамов
 
CV New (2016)
CV New (2016)CV New (2016)
CV New (2016)
 
Forskardagarna2010 webb
Forskardagarna2010 webbForskardagarna2010 webb
Forskardagarna2010 webb
 
Te dare lo Mejor / Bass line
Te dare lo Mejor /  Bass lineTe dare lo Mejor /  Bass line
Te dare lo Mejor / Bass line
 
Software e hardware
Software e hardwareSoftware e hardware
Software e hardware
 

Ähnlich wie DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

3rd 3DDRESD: RC historical contextualization
3rd 3DDRESD: RC historical contextualization3rd 3DDRESD: RC historical contextualization
3rd 3DDRESD: RC historical contextualizationMarco Santambrogio
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLinaro
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...NECST Lab @ Politecnico di Milano
 

Ähnlich wie DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task (9)

3rd 3DDRESD: RC historical contextualization
3rd 3DDRESD: RC historical contextualization3rd 3DDRESD: RC historical contextualization
3rd 3DDRESD: RC historical contextualization
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Sucet os module_5_notes
Sucet os module_5_notesSucet os module_5_notes
Sucet os module_5_notes
 
CAPS_Presentation.pdf
CAPS_Presentation.pdfCAPS_Presentation.pdf
CAPS_Presentation.pdf
 
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
 

DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task

  • 1. DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 1 / 52 DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task Maria Eskevich, Gareth J.F. Jones Centre for Digital Video Processing Centre for Next Generation Localisation School of Computing Dublin City University Ireland December, 8, 2011
  • 2. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 2 / 52 Outline Retrieval Methodology Transcript Preprocessing Text Segmentation Retrieval Setup Results Official Metrics uMAP pwMAP fMAP Conclusions
  • 3. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 3 / 52 Retrieval Methodology Transcript
  • 4. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 4 / 52 Retrieval Methodology Transcript Segmentation Topically Coherent Segments
  • 5. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 5 / 52 Retrieval Methodology Transcript Segmentation Topically Indexed Coherent Segments Indexing Segments
  • 6. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 6 / 52 Retrieval Methodology Transcript Queries Segmentation Information Request Topically Indexed Coherent Segments Indexing Segments
  • 7. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 7 / 52 Retrieval Methodology Transcript Queries Segmentation Information Request Topically Indexed Retrieval Coherent Segments Indexing Segments Retrieval Results
  • 8. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 8 / 52 6 Retrieval Runs 1-best ASR with Manual 1-best ASR stop words removed Transcript
  • 9. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 9 / 52 6 Retrieval Runs 1-best ASR with Manual 1-best ASR stop words removed Transcript C99 ASR C99 TT ASR TT
  • 10. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 10 / 52 6 Retrieval Runs 1-best ASR with Manual 1-best ASR stop words removed Transcript C99 C99 ASR C99 TT ASR NSW C99 TT ASR TT ASR NSW TT
  • 11. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 11 / 52 6 Retrieval Runs 1-best ASR with Manual 1-best ASR stop words removed Transcript C99 C99 C99 ASR C99 TT ASR NSW C99 TT Manual C99 TT ASR TT ASR NSW TT Manual TT
  • 12. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 12 / 52 6 Retrieval Runs ASR C99 ASR NSW C99 Manual C99 ASR TT ASR NSW TT Manual TT
  • 13. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 13 / 52 Transcript Preprocessing Recognize individual morphemes of the sentences: ChaSen 2.4.0, based on Japanese morphological analyzer JUMAN 2.0 with ipadic grammar 2.7.0
  • 14. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 14 / 52 Transcript Preprocessing Recognize individual morphemes of the sentences: ChaSen 2.4.0, based on Japanese morphological analyzer JUMAN 2.0 with ipadic grammar 2.7.0 Form the text out of the base forms of the words in order to avoid stemming
  • 15. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 15 / 52 Transcript Preprocessing Recognize individual morphemes of the sentences: ChaSen 2.4.0, based on Japanese morphological analyzer JUMAN 2.0 with ipadic grammar 2.7.0 Form the text out of the base forms of the words in order to avoid stemming Remove the stop words (SpeedBlog Japanese Stop-words) for one of the runs
  • 16. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 16 / 52 Text Segmentation Use of the algorithms originally developed for text: Individual IPUs are treated as sentences TextTiling: Cosine similarities between adjacent blocks of sentences C99: Compute similarity between sentences using a cosine similarity measure to form a similarity matrix Cosine scores are replaced by the rank of the score in the local region Segmentation points are assigned using a clustering procedure
  • 17. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 17 / 52 Retrieval Setup SMART information retrieval system extended to use language modelling with a uniform document prior probability.
  • 18. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 18 / 52 Retrieval Setup SMART information retrieval system extended to use language modelling with a uniform document prior probability. A query q is scored against a document d within the SMART framework in the following way: n P(q|d) = (λi P(qi |d) + (1 − λi )P(qi )) i=1
  • 19. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 19 / 52 Retrieval Setup SMART information retrieval system extended to use language modelling with a uniform document prior probability. A query q is scored against a document d within the SMART framework in the following way: n P(q|d) = (λi P(qi |d) + (1 − λi )P(qi )) i=1 where q = (q1 , . . . qn ) is a query comprising of n query terms, P(qi |d) is the probability of generating the i th query term from a given document d being estimated by the maximum likelihood, P(qi ) is the probability of generating it from the collection and is estimated by document frequency
  • 20. Retrieval Methodology DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 20 / 52 Retrieval Setup SMART information retrieval system extended to use language modelling with a uniform document prior probability. A query q is scored against a document d within the SMART framework in the following way: n P(q|d) = (λi P(qi |d) + (1 − λi )P(qi )) i=1 where q = (q1 , . . . qn ) is a query comprising of n query terms, P(qi |d) is the probability of generating the i th query term from a given document d being estimated by the maximum likelihood, P(qi ) is the probability of generating it from the collection and is estimated by document frequency The retrieval model used λi = 0.3 for all qi , this value being optimized on the TREC-8 ad hoc dataset.
  • 21. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 21 / 52 Outline Retrieval Methodology Transcript Preprocessing Text Segmentation Retrieval Setup Results Official Metrics uMAP pwMAP fMAP Conclusions
  • 22. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 22 / 52 Results: Official Metrics Transcript Segmentation uMAP pwMAP fMAP type type BASELINE 0.0670 0.0520 0.0536 manual tt 0.0859 0.0429 0.0500 manual C99 0.0713 0.0209 0.0168 ASR tt 0.0490 0.0329 0.0308 ASR C99 0.0469 0.0166 0.0123 ASR nsw tt 0.0312 0.0141 0.0174 ASR nsw C99 0.0316 0.0138 0.0120
  • 23. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 23 / 52 Results: Official Metrics Transcript Segmentation uMAP pwMAP fMAP type type BASELINE 0.0670 0.0520 0.0536 manual tt 0.0859 0.0429 0.0500 manual C99 0.0713 0.0209 0.0168 ASR tt 0.0490 0.0329 0.0308 ASR C99 0.0469 0.0166 0.0123 ASR nsw tt 0.0312 0.0141 0.0174 ASR nsw C99 0.0316 0.0138 0.0120 Only runs on the manual transcript had higher scores than the baseline (only uMAP metric)
  • 24. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 24 / 52 Results: Official Metrics Transcript Segmentation uMAP pwMAP fMAP type type BASELINE 0.0670 0.0520 0.0536 manual tt 0.0859 0.0429 0.0500 manual C99 0.0713 0.0209 0.0168 ASR tt 0.0490 0.0329 0.0308 ASR C99 0.0469 0.0166 0.0123 ASR nsw tt 0.0312 0.0141 0.0174 ASR nsw C99 0.0316 0.0138 0.0120 Only runs on the manual transcript had higher scores than the baseline (only uMAP metric) TextTiling results are consistently higher than C99 for all the metrics for manual and ASR runs
  • 25. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 25 / 52 Time-based Results Assessment Approach For each run and each query: Relevant Passage 1 Relevant Passage 2 Passage 3 Relevant Passage 4 ... where: Length of the Relevant Part Precision = Length of the Whole Passage
  • 26. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 26 / 52 Time-based Results Assessment Approach For each run and each query: Relevant Passage 1 Precision Relevant Passage 2 Precision Passage 3 Relevant Passage 4 Precision ... where: Length of the Relevant Part Precision = Length of the Whole Passage
  • 27. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 27 / 52 Time-based Results Assessment Approach For each run and each query: Average Relevant Passage 1 Precision of Precision per Query Relevant Passage 2 Precision for ranks with Passage 3 passages containing Relevant Passage 4 Precision relevant content ... where: Length of the Relevant Part Precision = Length of the Whole Passage
  • 28. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 28 / 52 Average of Precision for all passages with relevant content
  • 29. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 29 / 52 Average of Precision for all passages with relevant content TextTiling algorithm has higher average of precision for all types of transcript, i.e.topically coherent segments are better located
  • 30. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 30 / 52 Results: utterance-based MAP (uMAP) Transcript type Segmentation type uMAP BASELINE 0.0670 manual tt 0.0859 manual C99 0.0713 ASR tt 0.0490 ASR C99 0.0469 ASR nsw tt 0.0312 ASR nsw C99 0.0316
  • 31. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 31 / 52 Results: utterance-based MAP (uMAP) Transcript type Segmentation type uMAP BASELINE 0.0670 manual tt 0.0859 manual C99 0.0713 ASR tt 0.0490 ASR C99 0.0469 ASR nsw tt 0.0312 ASR nsw C99 0.0316 The trend ‘manual > ASR > ASR nsw’ for both C99 and TextTiling is not proved by the averages of precision
  • 32. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 32 / 52 Results: utterance-based MAP (uMAP) Transcript type Segmentation type uMAP BASELINE 0.0670 manual tt 0.0859 manual C99 0.0713 ASR tt 0.0490 ASR C99 0.0469 ASR nsw tt 0.0312 ASR nsw C99 0.0316 The trend ‘manual > ASR > ASR nsw’ for both C99 and TextTiling is not proved by the averages of precision Higher average values of the TextTiling segmentation over C99 are not reflected in the uMAP scores
  • 33. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 33 / 52 Results: utterance-based MAP (uMAP) Transcript type Segmentation type uMAP BASELINE 0.0670 manual tt 0.0859 manual C99 0.0713 ASR tt 0.0490 ASR C99 0.0469 ASR nsw tt 0.0312 ASR nsw C99 0.0316 The trend ‘manual > ASR > ASR nsw’ for both C99 and TextTiling is not proved by the averages of precision Higher average values of the TextTiling segmentation over C99 are not reflected in the uMAP scores For some of the queries runs on C99 segmentation have better ranking of the segments with relevant content
  • 34. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 34 / 52 Relevance of the Central IPU Assessment Number of ranks Average of Precision taken or not taken for the passages at ranks into account by pwMAP that are taken or not taken into account by pwMAP
  • 35. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 35 / 52 Relevance of the Central IPU Assessment Number of ranks Average of Precision taken or not taken for the passages at ranks into account by pwMAP that are taken or not taken into account by pwMAP TextTiling has higher numbers of segments that have central IPU relevant to the query
  • 36. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 36 / 52 Relevance of the Central IPU Assessment Number of ranks Average of Precision taken or not taken for the passages at ranks into account by pwMAP that are taken or not taken into account by pwMAP TextTiling has higher numbers of segments that have central IPU relevant to the query Overall the numbers of the ranks where the segment with relevant is retrieved is approximately the same for both segmentation techniques
  • 37. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 37 / 52 Results: pointwise MAP (pwMAP) Transcript type Segmentation type pwMAP BASELINE 0.0520 manual tt 0.0429 manual C99 0.0209 ASR tt 0.0329 ASR C99 0.0166 ASR nsw tt 0.0141 ASR nsw C99 0.0138
  • 38. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 38 / 52 Results: pointwise MAP (pwMAP) Transcript type Segmentation type pwMAP BASELINE 0.0520 manual tt 0.0429 manual C99 0.0209 ASR tt 0.0329 ASR C99 0.0166 ASR nsw tt 0.0141 ASR nsw C99 0.0138 TextTiling segmentation puts better topic boundaries for relevant content and have higher precision scores for the retrieved relevant passages
  • 39. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 39 / 52 Average Length of Relevant Part and Segments (in seconds) Center IPU is relevant
  • 40. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 40 / 52 Average Length of Relevant Part and Segments (in seconds) Center IPU is relevant Center IPU is relevant: Average length of the relevant content is of the same order for both segmentation schemes, slightly higher for TextTiling
  • 41. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 41 / 52 Average Length of Relevant Part and Segments (in seconds) Center IPU is not relevant
  • 42. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 42 / 52 Average Length of Relevant Part and Segments (in seconds) Center IPU is not relevant Center IPU is not relevant: Average length of the relevant content is higher for C99 segmentation, due to the poor segmentation it correlates with much longer segments
  • 43. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 43 / 52 Average Length of Relevant Part and Segments (in seconds) Center IPU is relevant Center IPU is not relevant Center IPU is relevant: Average length of the relevant content is of the same order for both segmentation schemes, slightly higher for TextTiling Center IPU is not relevant: Average length of the relevant content is higher for C99 segmentation, due to the poor segmentation it correlates with much longer segments
  • 44. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 44 / 52 Results: fraction MAP (fMAP) Transcript type Segmentation type fMAP BASELINE 0.0536 manual tt 0.0500 manual C99 0.0168 ASR tt 0.0308 ASR C99 0.0123 ASR nsw tt 0.0174 ASR nsw C99 0.0120
  • 45. Results DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 45 / 52 Results: fraction MAP (fMAP) Transcript type Segmentation type fMAP BASELINE 0.0536 manual tt 0.0500 manual C99 0.0168 ASR tt 0.0308 ASR C99 0.0123 ASR nsw tt 0.0174 ASR nsw C99 0.0120 Average number of ranks with segments having non-relevant center IPU is more than 5 times higher Segmentation technique with longer poor segmented passages (C99) has much lower precision-based scores
  • 46. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 46 / 52 Outline Retrieval Methodology Transcript Preprocessing Text Segmentation Retrieval Setup Results Official Metrics uMAP pwMAP fMAP Conclusions
  • 47. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 47 / 52 Conclusions TextTiling segmentation shows better overall retrieval performance than C99:
  • 48. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 48 / 52 Conclusions TextTiling segmentation shows better overall retrieval performance than C99: Higher numbers of segments with higher precision
  • 49. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 49 / 52 Conclusions TextTiling segmentation shows better overall retrieval performance than C99: Higher numbers of segments with higher precision Higher precision even for the segments with non-relevant center IPU
  • 50. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 50 / 52 Conclusions TextTiling segmentation shows better overall retrieval performance than C99: Higher numbers of segments with higher precision Higher precision even for the segments with non-relevant center IPU High level of poor segmentation makes it harder to retrieve relevant content for C99 runs
  • 51. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 51 / 52 Conclusions TextTiling segmentation shows better overall retrieval performance than C99: Higher numbers of segments with higher precision Higher precision even for the segments with non-relevant center IPU High level of poor segmentation makes it harder to retrieve relevant content for C99 runs Removal of stop words before segmentation did not have any positive effect on the results
  • 52. Conclusions DCU at the NTCIR-9 SpokenDoc Passage Retrieval Task 52 / 52 Thank you for your attention! Questions?