SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Tip: Data scoring: Convert data with XQuery
      Do quality analysis on conversion results

      Skill Level: Intermediate


      Geert Josten
      Consultant/Content Engineer
      Daidalos BV



      29 Sep 2009


      The process of converting data is one of migrating information from an unsuitable
      source or format to a suitable one—often not an exact science. Data scoring is a way
      to measure the accuracy of your conversion. Discover a simple scoring technique in
      XQuery that you can apply to the result of a small text-to-XML conversion.

                            Frequently used acronyms
                                   •   HTML: Hypertext Markup Language

                                   •   W3C: World Wide Web Consortium

                                   •   URL: Uniform Resource Locator

                                   •   XML: Extensible Markup Language

                                   •   XSLT: Extensible Stylesheet Transformations


      Scoring converted data is all about analyzing the quality of the conversion. Quality
      can mean different things, and converting data from a database carries with it
      different problems than converting data from documents with more natural language.
      The technique that this tip presents makes no assumptions: You can apply it to any
      XML code of interest. To see the technique in practice, you will convert plain
      text—not comma-separated files, but plain text from news items grabbed from the
      Internet.


      Plain text input

Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                 Page 1 of 15
developerWorks®                                                                       ibm.com/developerWorks



     For the source, I grabbed text from the Google news Web site using the URL
     http://news.google.com/news/section?pz=1&topic=w&ict=ln. Figure 1 shows the
     resulting page.

     Figure 1. Example news item on Google news




     These news items have a basic structure: They start with a heading followed by
     source details, the news message itself, and some additional information from
     different sources. In this tip, you will extract the headline, location, text, related
     headline, and the source of the related headline. Figure 2 shows these elements.

     Figure 2. Analysis of the text of an example Google news item



Data scoring: Convert data with XQuery
Page 2 of 15                                               © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                                        developerWorks®




      For this example, I selected the top three news items for that moment, copied the
      text, and removed the lines this tip doesn't discuss, just to save space. I also
      separated the text into the individual news items to give you a head start.

                            Line ends
                            Line ends are always represented by the numeric character 10 in
                            memory when working with XQuery, regardless of the operating
                            system.



      You will use the input data in Listing 1. (I added paragraph characters [¶] to visualize
      line ends. The entity " represents the double quotation mark character.)


Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                            Page 3 of 15
developerWorks®                                                                     ibm.com/developerWorks



     Listing 1. The plain text

       let $news-items := (
        "Afghans go to polls under threat of Taliban violence¶
       KABUL, Afghanistan (CNN) - Under the menacing threat of violence from the
       Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's
       second-ever national election.¶
       Afghan people cast votes in hope of better future RT",
         "Security stepped up after Baghdad bombings¶
       BAGHDAD, Iraq (CNN) - The Iraqi government implemented new security measures a
       day after a string of bombings in Baghdad killed at least 100 people and wounded
       hundreds more, an interior ministry official told CNN on Thursday.¶
       Questioning security in Baghdad's Green Zone - 19 Aug 09 Al Jazeera",
         "Will Democrats Go At It Alone on the Health Care Bill?¶
       This is a rush transcript from "On the Record," August 19, 2009. This copy may not
       be in its final form and may be updated.¶
       New Health Care Strategy CBS"
       )


     To see the scoring technique in practice, you will define patterns to extract
     information, convert to XML, and apply the technique on the result. The big issue is
     not finding patterns but converting those patterns to reliable extraction rules. This
     technique is a useful tool to help you analyze the reliability of your own rules.

     Before you start to apply patterns, however, first break down the structure of each
     news item. Each news item has three lines of data:

                • Headline
                • Message
                • Headline of a related news item
     You iterate over all news items and separate the lines by tokenizing on line ends:


       let $result :=
         for $news-item in $news-items
          let $lines := tokenize($news-item, '
')


     The result variable captures the XML result.


     Applying patterns
     Analyzing conversion quality is only interesting when information is missing or joined
     together and must be separated—which this tip covers. Look at each line to search
     for patterns to separate combined information.

     The first line


Data scoring: Convert data with XQuery
Page 4 of 15                                             © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                                 developerWorks®



      The first line contains only the headline. Apply normalize-space to trim redundant
      spaces:


       let $headline := normalize-space($lines[1])


      The second line

      The second line—the one with the message—obviously contains the most
      information, but has no reliable pattern except for the location of the event. You can
      find this location at the start of the line, followed by a dash. Use the dash to separate
      the location from the text:


       let $location := normalize-space(substring-before($lines[2], '-'))
         let $text := normalize-space(substring-after($lines[2], '-'))


      The third line

      The third line is the most challenging: It contains both the headline and the name of
      the source without a marker to separate them visually. You can't know all the names,
      so you can't match them literally. Names typically start with a capital letter, which
      you can accommodate using regular expressions:


       let $related-headline :=
           normalize-space(replace($lines[3], '^(.*?) [A-Z].*$', '$1'))
         let $related-headline-source :=
           normalize-space(replace($lines[3], '^.*? ([A-Z].*)$', '$1'))


      Basically, you match the string from start (^) to end ($) in these regular expressions.
      The .*? matches up to the first space-capital combination and should capture the
      headline text. The [A-Z].* should capture the source of that headline. By putting
      opening and closing parentheses [( and )] around the part you're interested in and
      using $1 as the replacement, you should end up with either the headline or the
      source, depending on where you put the parentheses.


      Conversion result
      Add the lines in Listing 2 to tag the extracted information.

      Listing 2. Converting the extracted information to XML

       return
       <news-item>{
         if ($headline) then
           <headline>{$headline}</headline>


Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                      Page 5 of 15
developerWorks®                                                                    ibm.com/developerWorks




          else (),
          if ($location) then
            <location>{$location}</location>
          else (),
          if ($text) then
            <text>{$text}</text>
          else (),
          if ($related-headline) then
            <related-headline>{$related-headline}</related-headline>
          else (),
         if ($related-headline-source) then
           <related-headline-source>{
             $related-headline-source
           }</related-headline-source>
         else ()
       }</news-item>


     The if statements around the tags are to ensure that only those tags are written
     that contain a non-empty value. You can gather all expressions so far and append
     the following lines to make it output the first news item:


       return
         $result[1]


     The lines in Listing 3 show the expected output.

     Listing 3. XML output of the first news item

       <news-item>
         <headline>Afghans go to polls under threat of Taliban violence</headline>
         <location>KABUL, Afghanistan (CNN)</location>
         <message>Under the menacing threat of violence from the
           Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's
           second-ever national election.</message>
         <related-headline>Afghan people cast votes in hope of better future</related-headline>
         <related-headline-source>RT</related-headline-source>
       </news-item>


     Now, convert all three items. Then, you can start to score how well you did.


     Element scoring of the result
     To analyze the quality of conversion results, you can choose from several methods
     based on your needs. The technique presented here is basic and you can use it in
     various ways. It consists of showing statistics at two detail levels:

                • Scoring of each element of interest



Data scoring: Convert data with XQuery
Page 6 of 15                                            © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                                 developerWorks®




                • Scoring of each distinct element value for a particular element of interest
      The elements of interest in this conversion result are all elements containing text:

                • headline
                • location
                • text
                • related-headline
                • related-headline-source
      The first detail level is merely a matter of counting all elements and calculating the
      ratio of elements to the total number of items. You can calculate the total using
      XQuery. (Note that the XML output of the news items was captured in a variable
      named $result.)


       let $element-name = 'headline'
       let $element-score := count($result//*[local-name() = $element-name])
       let $element-ratio := round(100 * $element-score div count($news-items))


      Matching elements on their local name is rather slow but saves code and makes
      your script more dynamic. In Listing 4, you wrap the calculation results in a small
      HTML report.

      Listing 4. Creating an element scoring result

       let $total-number-of-items := count($news-items)
       let $elements-of-interest :=
         ('headline', 'location', 'text', 'related-headline',
          'related-headline-source')
       return
       <html>
         <body>
           <p><b>Total number of items: </b>{$total-number-of-items}</p>
             <table border="1">
               <tr>
                 <th>element name</th>
                 <th>score</th>
                 <th>ratio</th>
               </tr>
               {
               for $element-name in $elements-of-interest
               let $elements :=
                 $result//*[local-name() = $element-name]
               let $element-score :=
                  count($elements)
               let $element-ratio :=



Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                       Page 7 of 15
developerWorks®                                                                                      ibm.com/developerWorks




                  round(100 * $element-score div $total-number-of-items)
               return
             <tr>
               <td>{ $element-name }</td>
               <td>{ $element-score }</td>
               <td>{ concat($element-ratio,'%') }</td>
             </tr>
           }</table>
         </body>
       </html>


     To start, the code in Listing 4 displays the number of items being analyzed to give
     meaning to the ratios. Then, it loops over all elements of interest, calculates score
     and ratio, and writes a table row for each. Figure 3 shows the report of the
     conversion result. (View a text-only version of Figure 3.)

     Figure 3. Element scoring of the result




     The scores look high, but there are some drop-outs on location and text.
     Continue with the scoring of element values and investigate.


     Value scoring of the result
                           Value scoring on large datasets
                           When you apply this scoring technique on larger datasets, it can
                           result in long calculation times on the reports. Consider testing this
                           code on a small set first, then optimize the code to use the full
                           capabilities of your XQuery processor as soon as the calculation
                           time exceeds one second. If you use an XQuery-capable XML
                           database, you should be able to use indices to make things even



Data scoring: Convert data with XQuery
Page 8 of 15                                                              © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                               developerWorks®



                            quicker.



      Scoring the values of elements is almost as straightforward as scoring elements.
      Just determine the distinct values of each element and calculate a score and ratio
      for each value.

      For additional information, you can calculate the score and ratio of missing
      elements. Replace the code in Listing 4 with the code in Listing 5.

      Listing 5. Creating an element and value scoring report

       <html>
         <body>
           <p><b>Total number of items: </b>{$total-number-of-items}</p>
           <table border="1" width="50%">
             <tr>
               <th colspan="2">element name</th>
               <th colspan="2">score</th>
               <th colspan="2">ratio</th>
             </tr>
             {
             for $element-name in $elements-of-interest
               let $elements :=
                 $result//*[local-name() = $element-name]
               let $element-score :=
                 count($elements)
               let $element-ratio :=
                 round(100 * $element-score div $total-number-of-items)
               return (
                 <tr>
                   <td colspan="2">{ $element-name }</td>
                   <td colspan="2">{ $element-score }</td>
                   <td colspan="2">{ concat($element-ratio,'%') }</td>
                 </tr>,
                  let $distinct-values :=
                    distinct-values($elements)
                  for $distinct-value in $distinct-values
                  let $value-score :=
                    count($elements[. = $distinct-value])
                  let $value-ratio :=
                    round(100 * $value-score div $total-number-of-items)
                  return
                  <tr>
                    <td>&#160;</td>
                    <td><i>{ $distinct-value }</i></td>
                    <td>&#160;</td>
                    <td><i>{ $value-score }</i></td>
                    <td>&#160;</td>
                    <td><i>{ concat($value-ratio,'%') }</i></td>
                  </tr>,
                  let $missing-score :=
                    $total-number-of-items - $element-score
                  let $missing-ratio :=
                    round(100 * $missing-score div $total-number-of-items)



Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                     Page 9 of 15
developerWorks®                                                                       ibm.com/developerWorks




                  where $missing-score > 0
                  return
               <tr>
                 <td>&#160;</td>
                 <td><i><b>(not found)</b></i></td>
                 <td>&#160;</td>
                 <td><i>{ $missing-score }</i></td>
                 <td>&#160;</td>
                 <td><i>{ concat($missing-ratio,'%') }</i></td>
               </tr>
             )}
           </table>
         </body>
       </html>


     The HTML table has six columns in this version of the report, but otherwise, the
     previous code hasn't changed. Listing 5 only extends the code to perform a sub-loop
     for each item that iterates over all its distinct values, outputting additional table rows
     with a score and ratio for each value. It also adds a table row when the appropriate
     element is missing in at least one of the results.


     Analyzing the scores
     The scoring report is helpful for at least three use cases:

                • Find drop-outs
                • Inspect value distributions
                • Search for anomalies in values
     Drop-outs

     In the element score report (Figure 3 or the alternate text version of Figure 3), you
     saw a low score for location and text. Figure 4 shows the part of the value
     scoring report for these elements. (View a text-only version of Figure 4.)

     Figure 4. Value scoring of location and text




Data scoring: Convert data with XQuery
Page 10 of 15                                              © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                              developerWorks®




      The report shows that one news item in which a location could not be extracted and
      one in which the text could not be extracted. This absence is likely to concern the
      same news item. Unfortunately, this report doesn't provide the information necessary
      to determine why the conversion failed for those items, but it does at least show that
      it failed. Now, check the content and take a second look at the patterns behind these
      elements.

      Remember that location and text were separated from each other based on the
      presence of a dash. Return to Listing 1 and notice one news item in which the
      second line does not begin with a location. There is no dash, so the pattern to
      separate the items fails.

      Value distributions

      Apart from spotting drop-outs, you can also use these reports to analyze value
      distributions. Having converted only the three items, each value occurs only once, as
      you will see if you run the whole script and look at the report. Use the report on a
      larger dataset to experiment with this use case.

      Anomalies

      A third use case is to inspect the element values to spot anomalies. Figure 5 shows
      the part of the value scoring report that reveals the related-headline and
      related-headline-source elements. (View a text-only version of Figure 5.)

      Figure 5. Value scoring of related headline and source




Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                  Page 11 of 15
developerWorks®                                                                      ibm.com/developerWorks




     Notice that your extraction rule did succeed but produced incorrect values. The
     values of the related-headline-source should have been:

                • CBS
                • RT
                • Al Jazeera
     Producing correct values here is difficult, but at least this report can help you spot
     such mistakes. If a lexicon of valid source values is available, you can use it to
     validate the captured values, then signal for incorrect values or rule them out as
     false positives.


     Conclusion
     With simple calculations, you produced small HTML reports of element and value
     scores on the result of a small text-to-XML conversion. You successfully used these
     reports to analyze the quality of this conversion, spot drop-outs, look at value
     distributions, and find anomalies. This scoring technique can be a useful tool for
     quality analysis of conversion results.




Data scoring: Convert data with XQuery
Page 12 of 15                                             © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                             developerWorks®




      Downloads
       Description                                 Name                   Size   Download
                                                                                 method
      Source file for this tip                     data-scoring.xqy.zip   2KB    HTTP

       Information about download methods




Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                  Page 13 of 15
developerWorks®                                                                    ibm.com/developerWorks




     Resources
     Learn
         • XQuery 1.0: An XML Query Language: Read the W3C's XQuery specification.
         • XQuery 1.0 and XPath 2.0 Functions and Operators: Learn about the various
           functions and operators available in XQuery.
         • HTML 4.01: Read the W3C's HTML specification.
         • IBM XML certification: Find out how you can become an IBM-Certified
           Developer in XML and related technologies.
         • XML technical library: See the developerWorks XML Zone for a wide range of
           technical articles and tips, tutorials, standards, and IBM Redbooks.
         • developerWorks technical events and webcasts: Stay current with technology in
           these sessions.
         • The technology bookstore: Browse for books on these and other technical
           topics.
         • developerWorks podcasts: Listen to interesting interviews and discussions for
           software developers.
     Get products and technologies
         • IBM product evaluation versions: Download or explore the online trials in the
           IBM SOA Sandbox and get your hands on application development tools and
           middleware products from DB2®, Lotus®, Rational®, Tivoli®, and
           WebSphere®.
     Discuss
         • XML zone discussion forums: Participate in any of several XML-related
           discussions.
         • developerWorks blogs: Check out these blogs and get involved in the
           developerWorks community.



     About the author
     Geert Josten
     Geert Josten has been a content engineer at Daidalos for nearly 10 years, applying
     his knowledge of XSLT and XQuery as well as other, proprietary transformation
     languages. He also works as a Web and Java developer at Daidalos and has
     consulted for dozens of customers in a wide range of areas.



Data scoring: Convert data with XQuery
Page 14 of 15                                           © Copyright IBM Corporation 2009. All rights reserved.
ibm.com/developerWorks                                                            developerWorks®




      Trademarks
      IBM, the IBM logo, ibm.com, DB2, developerWorks, Lotus, Rational, Tivoli, and
      WebSphere are trademarks or registered trademarks of International Business
      Machines Corporation in the United States, other countries, or both. These and other
      IBM trademarked terms are marked on their first occurrence in this information with
      the appropriate symbol (® or ™), indicating US registered or common law
      trademarks owned by IBM at the time this information was published. Such
      trademarks may also be registered or common law trademarks in other countries.
      See the current list of IBM trademarks.
      Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
      trademarks or trademarks of Adobe Systems Incorporated in the United States,
      and/or other countries.
      Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the
      United States, other countries, or both.
      Other company, product, or service names may be trademarks or service marks of
      others.




Data scoring: Convert data with XQuery
© Copyright IBM Corporation 2009. All rights reserved.                                Page 15 of 15

Weitere ähnliche Inhalte

Was ist angesagt?

It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbaiUnmesh Baile
 
5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMSkoolkampus
 
A "M"ind Bending Experience. Power Query for Excel and Beyond.
A "M"ind Bending Experience. Power Query for Excel and Beyond.A "M"ind Bending Experience. Power Query for Excel and Beyond.
A "M"ind Bending Experience. Power Query for Excel and Beyond.Alex Powers
 
Intake 38 data access 4
Intake 38 data access 4Intake 38 data access 4
Intake 38 data access 4Mahmoud Ouf
 
Database mapping of XBRL instance documents from the WIP taxonomy
Database mapping of XBRL instance documents from the WIP taxonomyDatabase mapping of XBRL instance documents from the WIP taxonomy
Database mapping of XBRL instance documents from the WIP taxonomyAlexander Falk
 
Physical elements of data
Physical elements of dataPhysical elements of data
Physical elements of dataDimara Hakim
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 
Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using chayerpa
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query OptimizationBrian Gallagher
 

Was ist angesagt? (18)

It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS
 
A "M"ind Bending Experience. Power Query for Excel and Beyond.
A "M"ind Bending Experience. Power Query for Excel and Beyond.A "M"ind Bending Experience. Power Query for Excel and Beyond.
A "M"ind Bending Experience. Power Query for Excel and Beyond.
 
Programming in C
Programming in CProgramming in C
Programming in C
 
Intake 38 data access 4
Intake 38 data access 4Intake 38 data access 4
Intake 38 data access 4
 
Lab3cth
Lab3cthLab3cth
Lab3cth
 
Typed data in drupal 8
Typed data in drupal 8Typed data in drupal 8
Typed data in drupal 8
 
Database mapping of XBRL instance documents from the WIP taxonomy
Database mapping of XBRL instance documents from the WIP taxonomyDatabase mapping of XBRL instance documents from the WIP taxonomy
Database mapping of XBRL instance documents from the WIP taxonomy
 
Jit abhishek sarkar
Jit abhishek sarkarJit abhishek sarkar
Jit abhishek sarkar
 
Physical elements of data
Physical elements of dataPhysical elements of data
Physical elements of data
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using c
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
Chap08
Chap08Chap08
Chap08
 
Structure and union
Structure and unionStructure and union
Structure and union
 
Structures in c++
Structures in c++Structures in c++
Structures in c++
 
Ch3
Ch3Ch3
Ch3
 

Andere mochten auch

MarkLogicWorld 2013 - Automate your deployments
MarkLogicWorld 2013 - Automate your deploymentsMarkLogicWorld 2013 - Automate your deployments
MarkLogicWorld 2013 - Automate your deploymentsGeert Josten
 
RESTful Services
RESTful ServicesRESTful Services
RESTful ServicesKurt Cagle
 
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)Geert Josten
 
XQuery Novelties (XML Holland 2010 - hardcore xml)
XQuery Novelties (XML Holland 2010 - hardcore xml)XQuery Novelties (XML Holland 2010 - hardcore xml)
XQuery Novelties (XML Holland 2010 - hardcore xml)Geert Josten
 
XRX Presentation to Minnesota OTUG
XRX Presentation to Minnesota OTUGXRX Presentation to Minnesota OTUG
XRX Presentation to Minnesota OTUGOptum
 
XQuery Novelties (XML Holland 2010)
XQuery Novelties (XML Holland 2010)XQuery Novelties (XML Holland 2010)
XQuery Novelties (XML Holland 2010)Geert Josten
 
Xml holland - XQuery novelties - Geert Josten
Xml holland - XQuery novelties - Geert JostenXml holland - XQuery novelties - Geert Josten
Xml holland - XQuery novelties - Geert JostenDaidalos
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 

Andere mochten auch (8)

MarkLogicWorld 2013 - Automate your deployments
MarkLogicWorld 2013 - Automate your deploymentsMarkLogicWorld 2013 - Automate your deployments
MarkLogicWorld 2013 - Automate your deployments
 
RESTful Services
RESTful ServicesRESTful Services
RESTful Services
 
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)
Word als XML-auteursomgeving (&lt;!Element j10 nr2 2004)
 
XQuery Novelties (XML Holland 2010 - hardcore xml)
XQuery Novelties (XML Holland 2010 - hardcore xml)XQuery Novelties (XML Holland 2010 - hardcore xml)
XQuery Novelties (XML Holland 2010 - hardcore xml)
 
XRX Presentation to Minnesota OTUG
XRX Presentation to Minnesota OTUGXRX Presentation to Minnesota OTUG
XRX Presentation to Minnesota OTUG
 
XQuery Novelties (XML Holland 2010)
XQuery Novelties (XML Holland 2010)XQuery Novelties (XML Holland 2010)
XQuery Novelties (XML Holland 2010)
 
Xml holland - XQuery novelties - Geert Josten
Xml holland - XQuery novelties - Geert JostenXml holland - XQuery novelties - Geert Josten
Xml holland - XQuery novelties - Geert Josten
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 

Ähnlich wie Tip: Data Scoring: Convert data with XQuery

SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
Lecture17 ie321 dr_atifshahzad_js
Lecture17 ie321 dr_atifshahzad_jsLecture17 ie321 dr_atifshahzad_js
Lecture17 ie321 dr_atifshahzad_jsAtif Shahzad
 
Bc0053 – vb.net & xml
Bc0053 – vb.net & xmlBc0053 – vb.net & xml
Bc0053 – vb.net & xmlsmumbahelp
 
C++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsC++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsTawnaDelatorrejs
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066rahulsm27
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Progress Report
Progress ReportProgress Report
Progress Reportxoanon
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27Max Kleiner
 
Excel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment meExcel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment mejoney4
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...Edgar Dumalaog Jr.
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesCHOOSE
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations lokesh shanmuganandam
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionThomas Lee
 
A Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlA Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlIRJET Journal
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
 
CS 23001 Computer Science II Data Structures & AbstractionPro.docx
CS 23001 Computer Science II Data Structures & AbstractionPro.docxCS 23001 Computer Science II Data Structures & AbstractionPro.docx
CS 23001 Computer Science II Data Structures & AbstractionPro.docxfaithxdunce63732
 
Javascript part1
Javascript part1Javascript part1
Javascript part1Raghu nath
 

Ähnlich wie Tip: Data Scoring: Convert data with XQuery (20)

SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
5010
50105010
5010
 
Lecture17 ie321 dr_atifshahzad_js
Lecture17 ie321 dr_atifshahzad_jsLecture17 ie321 dr_atifshahzad_js
Lecture17 ie321 dr_atifshahzad_js
 
Bc0053 – vb.net & xml
Bc0053 – vb.net & xmlBc0053 – vb.net & xml
Bc0053 – vb.net & xml
 
C++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsC++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment Instructions
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Progress Report
Progress ReportProgress Report
Progress Report
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27
 
Excel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment meExcel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment me
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...
Common productivity tools: Advanced Word Processing Skills, Advanced Spreadsh...
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
 
A Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlA Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using Xml
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
CS 23001 Computer Science II Data Structures & AbstractionPro.docx
CS 23001 Computer Science II Data Structures & AbstractionPro.docxCS 23001 Computer Science II Data Structures & AbstractionPro.docx
CS 23001 Computer Science II Data Structures & AbstractionPro.docx
 
Xml 215-presentation
Xml 215-presentationXml 215-presentation
Xml 215-presentation
 
Javascript part1
Javascript part1Javascript part1
Javascript part1
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Tip: Data Scoring: Convert data with XQuery

  • 1. Tip: Data scoring: Convert data with XQuery Do quality analysis on conversion results Skill Level: Intermediate Geert Josten Consultant/Content Engineer Daidalos BV 29 Sep 2009 The process of converting data is one of migrating information from an unsuitable source or format to a suitable one—often not an exact science. Data scoring is a way to measure the accuracy of your conversion. Discover a simple scoring technique in XQuery that you can apply to the result of a small text-to-XML conversion. Frequently used acronyms • HTML: Hypertext Markup Language • W3C: World Wide Web Consortium • URL: Uniform Resource Locator • XML: Extensible Markup Language • XSLT: Extensible Stylesheet Transformations Scoring converted data is all about analyzing the quality of the conversion. Quality can mean different things, and converting data from a database carries with it different problems than converting data from documents with more natural language. The technique that this tip presents makes no assumptions: You can apply it to any XML code of interest. To see the technique in practice, you will convert plain text—not comma-separated files, but plain text from news items grabbed from the Internet. Plain text input Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 1 of 15
  • 2. developerWorks® ibm.com/developerWorks For the source, I grabbed text from the Google news Web site using the URL http://news.google.com/news/section?pz=1&topic=w&ict=ln. Figure 1 shows the resulting page. Figure 1. Example news item on Google news These news items have a basic structure: They start with a heading followed by source details, the news message itself, and some additional information from different sources. In this tip, you will extract the headline, location, text, related headline, and the source of the related headline. Figure 2 shows these elements. Figure 2. Analysis of the text of an example Google news item Data scoring: Convert data with XQuery Page 2 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 3. ibm.com/developerWorks developerWorks® For this example, I selected the top three news items for that moment, copied the text, and removed the lines this tip doesn't discuss, just to save space. I also separated the text into the individual news items to give you a head start. Line ends Line ends are always represented by the numeric character 10 in memory when working with XQuery, regardless of the operating system. You will use the input data in Listing 1. (I added paragraph characters [¶] to visualize line ends. The entity &#34; represents the double quotation mark character.) Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 3 of 15
  • 4. developerWorks® ibm.com/developerWorks Listing 1. The plain text let $news-items := ( "Afghans go to polls under threat of Taliban violence¶ KABUL, Afghanistan (CNN) - Under the menacing threat of violence from the Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's second-ever national election.¶ Afghan people cast votes in hope of better future RT", "Security stepped up after Baghdad bombings¶ BAGHDAD, Iraq (CNN) - The Iraqi government implemented new security measures a day after a string of bombings in Baghdad killed at least 100 people and wounded hundreds more, an interior ministry official told CNN on Thursday.¶ Questioning security in Baghdad's Green Zone - 19 Aug 09 Al Jazeera", "Will Democrats Go At It Alone on the Health Care Bill?¶ This is a rush transcript from &#34;On the Record,&#34; August 19, 2009. This copy may not be in its final form and may be updated.¶ New Health Care Strategy CBS" ) To see the scoring technique in practice, you will define patterns to extract information, convert to XML, and apply the technique on the result. The big issue is not finding patterns but converting those patterns to reliable extraction rules. This technique is a useful tool to help you analyze the reliability of your own rules. Before you start to apply patterns, however, first break down the structure of each news item. Each news item has three lines of data: • Headline • Message • Headline of a related news item You iterate over all news items and separate the lines by tokenizing on line ends: let $result := for $news-item in $news-items let $lines := tokenize($news-item, '&#10;') The result variable captures the XML result. Applying patterns Analyzing conversion quality is only interesting when information is missing or joined together and must be separated—which this tip covers. Look at each line to search for patterns to separate combined information. The first line Data scoring: Convert data with XQuery Page 4 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 5. ibm.com/developerWorks developerWorks® The first line contains only the headline. Apply normalize-space to trim redundant spaces: let $headline := normalize-space($lines[1]) The second line The second line—the one with the message—obviously contains the most information, but has no reliable pattern except for the location of the event. You can find this location at the start of the line, followed by a dash. Use the dash to separate the location from the text: let $location := normalize-space(substring-before($lines[2], '-')) let $text := normalize-space(substring-after($lines[2], '-')) The third line The third line is the most challenging: It contains both the headline and the name of the source without a marker to separate them visually. You can't know all the names, so you can't match them literally. Names typically start with a capital letter, which you can accommodate using regular expressions: let $related-headline := normalize-space(replace($lines[3], '^(.*?) [A-Z].*$', '$1')) let $related-headline-source := normalize-space(replace($lines[3], '^.*? ([A-Z].*)$', '$1')) Basically, you match the string from start (^) to end ($) in these regular expressions. The .*? matches up to the first space-capital combination and should capture the headline text. The [A-Z].* should capture the source of that headline. By putting opening and closing parentheses [( and )] around the part you're interested in and using $1 as the replacement, you should end up with either the headline or the source, depending on where you put the parentheses. Conversion result Add the lines in Listing 2 to tag the extracted information. Listing 2. Converting the extracted information to XML return <news-item>{ if ($headline) then <headline>{$headline}</headline> Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 5 of 15
  • 6. developerWorks® ibm.com/developerWorks else (), if ($location) then <location>{$location}</location> else (), if ($text) then <text>{$text}</text> else (), if ($related-headline) then <related-headline>{$related-headline}</related-headline> else (), if ($related-headline-source) then <related-headline-source>{ $related-headline-source }</related-headline-source> else () }</news-item> The if statements around the tags are to ensure that only those tags are written that contain a non-empty value. You can gather all expressions so far and append the following lines to make it output the first news item: return $result[1] The lines in Listing 3 show the expected output. Listing 3. XML output of the first news item <news-item> <headline>Afghans go to polls under threat of Taliban violence</headline> <location>KABUL, Afghanistan (CNN)</location> <message>Under the menacing threat of violence from the Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's second-ever national election.</message> <related-headline>Afghan people cast votes in hope of better future</related-headline> <related-headline-source>RT</related-headline-source> </news-item> Now, convert all three items. Then, you can start to score how well you did. Element scoring of the result To analyze the quality of conversion results, you can choose from several methods based on your needs. The technique presented here is basic and you can use it in various ways. It consists of showing statistics at two detail levels: • Scoring of each element of interest Data scoring: Convert data with XQuery Page 6 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 7. ibm.com/developerWorks developerWorks® • Scoring of each distinct element value for a particular element of interest The elements of interest in this conversion result are all elements containing text: • headline • location • text • related-headline • related-headline-source The first detail level is merely a matter of counting all elements and calculating the ratio of elements to the total number of items. You can calculate the total using XQuery. (Note that the XML output of the news items was captured in a variable named $result.) let $element-name = 'headline' let $element-score := count($result//*[local-name() = $element-name]) let $element-ratio := round(100 * $element-score div count($news-items)) Matching elements on their local name is rather slow but saves code and makes your script more dynamic. In Listing 4, you wrap the calculation results in a small HTML report. Listing 4. Creating an element scoring result let $total-number-of-items := count($news-items) let $elements-of-interest := ('headline', 'location', 'text', 'related-headline', 'related-headline-source') return <html> <body> <p><b>Total number of items: </b>{$total-number-of-items}</p> <table border="1"> <tr> <th>element name</th> <th>score</th> <th>ratio</th> </tr> { for $element-name in $elements-of-interest let $elements := $result//*[local-name() = $element-name] let $element-score := count($elements) let $element-ratio := Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 7 of 15
  • 8. developerWorks® ibm.com/developerWorks round(100 * $element-score div $total-number-of-items) return <tr> <td>{ $element-name }</td> <td>{ $element-score }</td> <td>{ concat($element-ratio,'%') }</td> </tr> }</table> </body> </html> To start, the code in Listing 4 displays the number of items being analyzed to give meaning to the ratios. Then, it loops over all elements of interest, calculates score and ratio, and writes a table row for each. Figure 3 shows the report of the conversion result. (View a text-only version of Figure 3.) Figure 3. Element scoring of the result The scores look high, but there are some drop-outs on location and text. Continue with the scoring of element values and investigate. Value scoring of the result Value scoring on large datasets When you apply this scoring technique on larger datasets, it can result in long calculation times on the reports. Consider testing this code on a small set first, then optimize the code to use the full capabilities of your XQuery processor as soon as the calculation time exceeds one second. If you use an XQuery-capable XML database, you should be able to use indices to make things even Data scoring: Convert data with XQuery Page 8 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 9. ibm.com/developerWorks developerWorks® quicker. Scoring the values of elements is almost as straightforward as scoring elements. Just determine the distinct values of each element and calculate a score and ratio for each value. For additional information, you can calculate the score and ratio of missing elements. Replace the code in Listing 4 with the code in Listing 5. Listing 5. Creating an element and value scoring report <html> <body> <p><b>Total number of items: </b>{$total-number-of-items}</p> <table border="1" width="50%"> <tr> <th colspan="2">element name</th> <th colspan="2">score</th> <th colspan="2">ratio</th> </tr> { for $element-name in $elements-of-interest let $elements := $result//*[local-name() = $element-name] let $element-score := count($elements) let $element-ratio := round(100 * $element-score div $total-number-of-items) return ( <tr> <td colspan="2">{ $element-name }</td> <td colspan="2">{ $element-score }</td> <td colspan="2">{ concat($element-ratio,'%') }</td> </tr>, let $distinct-values := distinct-values($elements) for $distinct-value in $distinct-values let $value-score := count($elements[. = $distinct-value]) let $value-ratio := round(100 * $value-score div $total-number-of-items) return <tr> <td>&#160;</td> <td><i>{ $distinct-value }</i></td> <td>&#160;</td> <td><i>{ $value-score }</i></td> <td>&#160;</td> <td><i>{ concat($value-ratio,'%') }</i></td> </tr>, let $missing-score := $total-number-of-items - $element-score let $missing-ratio := round(100 * $missing-score div $total-number-of-items) Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 9 of 15
  • 10. developerWorks® ibm.com/developerWorks where $missing-score > 0 return <tr> <td>&#160;</td> <td><i><b>(not found)</b></i></td> <td>&#160;</td> <td><i>{ $missing-score }</i></td> <td>&#160;</td> <td><i>{ concat($missing-ratio,'%') }</i></td> </tr> )} </table> </body> </html> The HTML table has six columns in this version of the report, but otherwise, the previous code hasn't changed. Listing 5 only extends the code to perform a sub-loop for each item that iterates over all its distinct values, outputting additional table rows with a score and ratio for each value. It also adds a table row when the appropriate element is missing in at least one of the results. Analyzing the scores The scoring report is helpful for at least three use cases: • Find drop-outs • Inspect value distributions • Search for anomalies in values Drop-outs In the element score report (Figure 3 or the alternate text version of Figure 3), you saw a low score for location and text. Figure 4 shows the part of the value scoring report for these elements. (View a text-only version of Figure 4.) Figure 4. Value scoring of location and text Data scoring: Convert data with XQuery Page 10 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 11. ibm.com/developerWorks developerWorks® The report shows that one news item in which a location could not be extracted and one in which the text could not be extracted. This absence is likely to concern the same news item. Unfortunately, this report doesn't provide the information necessary to determine why the conversion failed for those items, but it does at least show that it failed. Now, check the content and take a second look at the patterns behind these elements. Remember that location and text were separated from each other based on the presence of a dash. Return to Listing 1 and notice one news item in which the second line does not begin with a location. There is no dash, so the pattern to separate the items fails. Value distributions Apart from spotting drop-outs, you can also use these reports to analyze value distributions. Having converted only the three items, each value occurs only once, as you will see if you run the whole script and look at the report. Use the report on a larger dataset to experiment with this use case. Anomalies A third use case is to inspect the element values to spot anomalies. Figure 5 shows the part of the value scoring report that reveals the related-headline and related-headline-source elements. (View a text-only version of Figure 5.) Figure 5. Value scoring of related headline and source Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 11 of 15
  • 12. developerWorks® ibm.com/developerWorks Notice that your extraction rule did succeed but produced incorrect values. The values of the related-headline-source should have been: • CBS • RT • Al Jazeera Producing correct values here is difficult, but at least this report can help you spot such mistakes. If a lexicon of valid source values is available, you can use it to validate the captured values, then signal for incorrect values or rule them out as false positives. Conclusion With simple calculations, you produced small HTML reports of element and value scores on the result of a small text-to-XML conversion. You successfully used these reports to analyze the quality of this conversion, spot drop-outs, look at value distributions, and find anomalies. This scoring technique can be a useful tool for quality analysis of conversion results. Data scoring: Convert data with XQuery Page 12 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 13. ibm.com/developerWorks developerWorks® Downloads Description Name Size Download method Source file for this tip data-scoring.xqy.zip 2KB HTTP Information about download methods Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 13 of 15
  • 14. developerWorks® ibm.com/developerWorks Resources Learn • XQuery 1.0: An XML Query Language: Read the W3C's XQuery specification. • XQuery 1.0 and XPath 2.0 Functions and Operators: Learn about the various functions and operators available in XQuery. • HTML 4.01: Read the W3C's HTML specification. • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies. • XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. • developerWorks technical events and webcasts: Stay current with technology in these sessions. • The technology bookstore: Browse for books on these and other technical topics. • developerWorks podcasts: Listen to interesting interviews and discussions for software developers. Get products and technologies • IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®. Discuss • XML zone discussion forums: Participate in any of several XML-related discussions. • developerWorks blogs: Check out these blogs and get involved in the developerWorks community. About the author Geert Josten Geert Josten has been a content engineer at Daidalos for nearly 10 years, applying his knowledge of XSLT and XQuery as well as other, proprietary transformation languages. He also works as a Web and Java developer at Daidalos and has consulted for dozens of customers in a wide range of areas. Data scoring: Convert data with XQuery Page 14 of 15 © Copyright IBM Corporation 2009. All rights reserved.
  • 15. ibm.com/developerWorks developerWorks® Trademarks IBM, the IBM logo, ibm.com, DB2, developerWorks, Lotus, Rational, Tivoli, and WebSphere are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. See the current list of IBM trademarks. Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 15 of 15