SlideShare ist ein Scribd-Unternehmen logo
1 von 72
No tempest in my teapot:
                Analysis of Crowdsourced Data
                 and User Experiences at the
                 California Digital Newspaper
                           Collection


                                               Brian Geiger
                         Director, Center for Bibliographical Studies and Research
                                  California Digital Newspaper Collection

                                           Frederick Zarndt
                                     Chair, IFLA Newspapers Section

                                                             Photo held by John Oxley Library, State Library of Queensland. Original from

                                                             Courier-mail, Brisbane, Queensland, Australia.

Friday, January 25, 13                                                                                                                      1
Try crowdsourcing!
                               Correct California newspapers text
                                      http://cdnc.ucr.edu

                               Correct Australian newspapers text
                                    http://trove.nla.gov.au

                               Correct Cambridge MA newspapers text
                                   http://bit.ly/cambridgepublic

                               Correct Russian language periodicals
                                 http://bit.ly/russianperiodicals




Friday, January 25, 13                                                2
Crowds
Friday, January 25, 13   3
The Wisdom of Crowds

             In 2004 James Surowiecki published “The Wisdom
             of Crowds: Why the Many Are Smarter Than the
             Few and How Collective Wisdom Shapes Business,
             Economies, Societies and Nations”. In it he asserts

                         a crowd of persons that are diverse,
                         independent, and decentralized usually make
                         better judgements or decisions than single
                         persons



Friday, January 25, 13                                                 4
“crowdsourcing”

              was coined by Jeff Howe in “The rise of
              crowdsourcing” published in Wired magazine June
              2006.




Friday, January 25, 13                                          5
A Google advanced search for
                         “crowdsourcing” from 1-Jun-2006, the date
                          of publication of Jeff Howe’s Wired magazine
                            article, to 1-Jun-2007 gives 44,600 hits.
                         A date range of 1-Jun-2011 to 1-Jun-2012 gives
                                        2,680,000 hits.




                            Searches used the Internet Archives’ Wayback Machine
Friday, January 25, 13                                                             6
Crowdsourcing is a process that
                involves outsourcing tasks to a distributed
                 group of people. ... the difference between
                crowdsourcing and ordinary outsourcing is
                 that a task or problem is outsourced to an
                   undefined public rather than a specific
                       body, such as paid employees.



Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing
(accessed June 1, 2012)
Friday, January 25, 13                                                                                                  7
Crowdsourcing is a type of participative online activity in
            which an individual, an institution, a non-profit
            organization, or company proposes to a group of individuals
            of varying knowledge, heterogeneity, and number, via a
            flexible open call, the voluntary undertaking of a task. The
            undertaking of the task, of variable complexity and
            modularity, and in which the crowd should participate
            bringing their work, money, knowledge and/or experience,
            always entails mutual benefit. The user will receive the
            satisfaction of a given type of need, be it economic, social
            recognition, self-esteem, or the development of individual
            skills, while the crowdsourcer will obtain and utilize to their
            advantage that what the user has brought to the venture,
            whose form will depend on the type of activity undertaken.



Enrique Estellés-Arolas and Fernando González-Ladrón-de-Guevara. Towards an integrated crowdsourcing definition.
Journal of Information Science XX(X). 2012. pp. 1-14.
Friday, January 25, 13                                                                                             8
crowdcollaboration    crowd*




                                   crowdsourcing
                    ng
                  di



     cr citizen science
                un
            df




       ow
     ow




         dl
 cr




            ea
               rn
                 in
crowdcasting       g
                          crowdvoting
Friday, January 25, 13                             9
what is Alexa?
    •   Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is
        the measurement, collection, analysis and reporting of Internet data for the purposes of
        understanding and optimizing web usage. Alexa is now a subsidiary of Amazon.

    •   Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat.

    •   Alexa operations includes archiving of webpages as they are crawled. This database served
        as the basis for the creation of the Internet Archive accessible through the Wayback
        Machine.

    •   Alexa continually crawls all publicly-available websites to create a series of snapshots of
        the web.

    •   Alexa gathers information from a variety of sources to provide key statistics about each
        site on the web, for example, Traffic Rank, the number of PageViews, and site Speed,
        Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000
        worldwide).




Friday, January 25, 13                                                                                10
definitions
             •    A PageView is a request for a file whose type is defined as a page.

             •    A Unique Visitor is a uniquely identified client generating requests on the
                  web server or viewing pages within a defined time period (i.e. day, week or
                  month). A Unique Visitor counts once within the timescale.

             •    A Visit is a series of page requests from the same uniquely identified client
                  with a time of no more than 30 minutes between each page request.

             •    Bounce Rate is the percentage of visits where the visitor enters and exits at
                  the same page without visiting any other pages on the site in between.

             •    World | Country Rank is a function of the average daily unique visits and
                  the number of unique pages requested.




    definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics

Friday, January 25, 13                                                                            11
crowdsourcing




                                  Amazon Mechanical Turk was launched Nov 2005
                         Alexa global rank of Amazon Mechanical Turk (23-Jan-2013): 8,346

Friday, January 25, 13                                                                      12
crowdsourcing




                   Each day 100,000,000 recaptcha’s are served to websites around the world
                               recaptcha is used by more than 200,000 websites

Friday, January 25, 13                                                                        13
crowdvoting

                          Iowa Electronic Market was 1st
                          launched in 1995

                          Alexa global traffic rank of Iowa
                          Electronic Market (6-Aug-2012):
                          11,290

                          Alexa US traffic rank of Iowa
                          Electronic Market (6-Aug-2012):
                          3,923




Friday, January 25, 13                                        14
citizen science




                                    Galaxy Zoo was 1st launched July 2007
                         Alexa global traffic rank of Galaxy Zoo (13-Jun-2012): 557,766

Friday, January 25, 13                                                                    15
crowdfunding




                                          Kickstarter was 1st launched in 2008
                               Alexa global traffic rank of Kickstarter (6-Aug-2012): 752
                         27,528 projects successfully funded with more than USD $254,000,000


Friday, January 25, 13                                                                         16
crowdlearning




                     duolingo was 1st launched Nov 2011 (private beta) / Jun 2012 (public)
                  Alexa global / country traffic rank of duolingo (23-Jan-2013): 22,052 / 11,761

Friday, January 25, 13                                                                             17
crowdcollaboration




Friday, January 25, 13                        18
Wikipedia

            •   Began 2001

            •   Now in 285 languages

            •   3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in French,
                1,050,000 in Dutch

            •   40 wikipedia languages with more than 100,000 articles

            •   112 wikipedia languages with more than 10,000 articles

            •   400,000,000 unique visitors per month

            •   85,000 active contributors

            •   Alexa global traffic rank: #6 in worldwide, #7 in USA web traffic (23-Jan-2013)




Friday, January 25, 13                                                                            19
Friday, January 25, 13   20
Family Search Indexing was 1st launched (beta) 2004
                Alexa global / country traffic rank of FamilySearch (13-Jun-2012): 4,352 / 1,357

Friday, January 25, 13                                                                             21
• Started (beta) 2004
             • More than 780,000 worldwide registered volunteers
               from ~25 countries index records relevant to family
               history
             • Approximately 100,000 active volunteers each month
             • UI in Chinese, English, German, French, Italian,
               Japanese, Korean, Portuguese, and Russian
             • Blind double-key entry with arbitration / reconciliation
             • More than 1,500,088,741 records indexed (July 2012)
             • Accuracy typically > 99.95%

Friday, January 25, 13                                                    22
Project Gutenberg was 1st launched Dec 1971
                         Alexa global traffic rank of Project Gutenberg (13-Jun-2012): 5,744

Friday, January 25, 13                                                                         23
• Started Dec 1971
              • Worldwide volunteers transcribe or proofread OCR’d
                public domain books through Distributed Proofreaders
              • 40,000 books completed (July 2012)
              • Partner / affiliated projects for Australia, Canada,
                Europe, Germany, Luxembourg, Philippines, Runeberg
                (Nordic literature), Russia, Taiwan




Friday, January 25, 13                                                 24
Alexa global / country traffic rank of National Library of Australia (31-Oct-2012): 15,519 / 406
                           Trove gets ~72% of all National Library web traffic.
Friday, January 25, 13                                                                             25
National Library of
                              Australia
                 • Online since 2008
                 • 7,200,000+ pages
                 • Top text corrector 1,250,000 lines (June 2012)
                 • 2,450,000+ lines corrected each month (average
                   for 1st 6 months 2012)
                 • 68,908,757 lines corrected as of July 2012, up
                   from 42,411,468 lines corrected July 2011.
                 • 63,613 total registered users (July 2012)
                 • 4,146 active users (June 2012)

Friday, January 25, 13                                              26
Alexa global / country traffic rank of National Library of Finland
                                   2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)
Friday, January 25, 13                                                                        27
National Library of
                              Finland
          • Digitalkoot is a project to improve OCR text in
            digitized newspapers -- by playing games!
          • Digitalkoot is a collaboration between the National
            Library and Microtask
          • Players correct OCR text by playing Myyräsillassa
            (Mole Bridge) or Myyräjahdissa (Mole Hunt)
          • National Library has 4,000,000+ digitized pages
          • 109,321 registered players (October 2012)
          • Since February 2011 8,024,530 micro-tasks have
            been completed
Friday, January 25, 13                                            28
Alexa global / country traffic rank of UC Riverside (31-Oct-2012): 12,439 / 4,717
                               CDNC gets ~1.84% of all UC Riverside web traffic.
Friday, January 25, 13                                                                             29
California Digital
                         Newspaper Collection
             • CDNC began digitizing newspapers in 2005 as
               part of NDNP
             • Newspapers digitized to article-level as well as to
               page-level as required by NDNP
             • Hosted on Veridian beginning 2009
             • Collection size 55,970 issues, 495,175 pages,
               5,658,224 articles, 498,000,000+ lines



Friday, January 25, 13                                               30
OCR text correction

             • OCR text correction added August 2011
             • Corrections are done line by line
             • ~578,000+ lines of text corrected (Oct 2012)
             • ~1.1% of the collection corrected, 98.9% to go!
             • Top corrector 243,000 lines > 2x 2nd corrector




Friday, January 25, 13                                           31
User Lines corrected Lines corrected User
                      1      242,965        1,456,906      1
                      2       87,515        1,385,369      2
                      3       31,318        1,010,360      3
                      4       24,144         960,230       4
                      5       23,184         847,340       5
                      6       19,240         786,147       6
                      7      18,898          657,187       7
                      8       16,875         600,513       8
                      9       11,784         582,276       9
                     10        9,762         565,384      10
 Number of lines corrected as of Oct 2012

Friday, January 25, 13                                          32
uncorrected OCR accuracy by
                           newspaper title
                                                               OCR character   ~OCR word
                                    Title
                                                                 accuracy       accuracy*

                 PRP Pacific Rural Press 1871 - 1922               92.6%         68.1%

                 SFC San Francisco Call 1890 - 1913                92.6%         68.1%

                 LAH Los Angeles Herald 1873 - 1910                88.7%         54.9%

                 LH Livermore Herald 1877 - 1899                   88.6%         54.6%

                 DAC Daily Alta California 1841 - 1891             88.2%         53.4%

                 CFJ California Farmer and Journal
                                                                   86.5%         48.4%
                 of Useful Sciences 1855 - 1880

                 SN Sausalito News 1885 - 1922                     70.4%         17.3%

  *Word accuracy assumes average word length is 5 characters

Friday, January 25, 13                                                                      33
OCR accuracy by newspaper title

                                                         OCR character   Corrected
                                  Title
                                                           accuracy      accuracy

                 PRP Pacific Rural Press 1871 - 1922         92.6%         99.3%

                 SFC San Francisco Call 1890 - 1913          92.6%         99.6%

                 LAH Los Angeles Herald 1873 - 1910          88.7%         99.1%

                 LH Livermore Herald 1877 - 1899             88.6%         99.9%

                 DAC Daily Alta California 1841 - 1891       88.2%         99.9%

                 CFJ California Farmer and Journal
                                                             86.5%         99.8%
                 of Useful Sciences 1855 - 1880

                 SN Sausalito News 1885 - 1922               70.4%        100.0%



Friday, January 25, 13                                                               34
corrected accuracy by
                       newspaper title
                                    OCR character ~OCR word Corrected  ~Corrected
                         Title
                                      accuracy     accuracy* accuracy word accuracy*

             PRP 1871 - 1922               92.6%               68.1%   99.3%    96.5%

             SFC 1890 - 1913               92.6%               68.1%   99.6%    98.0%

             LAH 1873 - 1910               88.7%               54.9%   99.1%    95.6%

             LH 1877 - 1899               88.6%                54.6%   99.9%    99.5%

             DAC 1841 - 1891              88.2%                53.4%   99.9%    99.5%

             CF 1855 - 1880                86.5%               48.4%   98.3%    91.8%

             SN 1885 - 1922                70.4%               17.3%   100.0%   100.0%


  *Word accuracy assumes average word length is 5 characters

Friday, January 25, 13                                                                   35
correction accuracy
                               by user
                                 Average OCR   Correction
                          User
                                   accuracy     accuracy
                           A        70.4%       100.0%
                           B        87.1%        99.5%
                           C        95.4%        99.5%
                           D        86.5%        98.3%
                           E        95.3%       100.0%
                           F        91.0%       100.0%
                           G        91.0%        99.8%
                           H        90.5%        99.0%
                           I        96.6%        99.8%
                           J        94.8%       100.0%
                           K        86.8%        99.3%



Friday, January 25, 13                                      36
the long    of crowdsourced       tail *

      OCR text correction
                          a probability distribution has a long tail if a larger
                         share of population rests within its tail than it would
                                     under a normal distribution

                         the most productive users represent a small fraction
                             of the total user population and ~50% of total
                             production, or, said a different way, the largest
                            fraction but individually not quite so productive
                          users are as important as the most productive users




The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail
and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.
Friday, January 25, 13                                                                                              37
OCR text correction long tails
                              3,000,000




                              2,250,000
                                             50%
    300000



 top corrector 242,965        1,500,000   top corrector 1,456,906
    225000



              50%              750,000

    150000                                                                       50%

                                     0

     75000                                                          NLA lines corrected by text corector


                                                      50%
         0

                         CDNC lines corrected by text corrector
Friday, January 25, 13                                                                                     38
Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation
in Crowdsourcing – A Study on Mechanical Turk.”

Friday, January 25, 13                                                     39
Wisdom of crowds

                                 Each person should have private information
              Diversity          even if it's just an eccentric interpretation of the
                                 known facts.
                      People's opinions aren't determined by the
         Independence
                      opinions of those around them.
                                 People are able to specialize and draw on local
       Decentralization
                                 knowledge.
                                 Some mechanism exists for turning private
           Aggregation
                                 judgments into a collective decision.

      James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective
      Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005.

Friday, January 25, 13                                                                                   40
Cognitive surplus

            ... people are learning to use their free time for creative
            activities rather than consumptive ones [such as watching
            TV] ...

            ... the total human cognitive effort in creating all of
            Wikipedia in every language is about one hundred million
            hours ...

            ... Americans alone watch two hundred billion hours of TV
            every year, or enough time, if it would be devoted to projects
            similar to Wikipedia, to create about 2000 of them ...


       Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.


Friday, January 25, 13                                                                                                41
Motivation
                         Genealogists and family historians
                           • National Library of Australia’s 2012 Trove
                             status report showed that ~50% of Trove users
                             are family historians
PAPERSPAST                 • National Library of New Zealand survey found
                             that ~50% of PapersPast users are genealogists
                           • California Digital Newspaper Collection spring
                             2012 survey discovered that ~70% of its users
                             are genealogists; 75% are 50 years old or older
                           • A Utah Digital Newspapers survey showed that
                             72% of its users are genealogists

Friday, January 25, 13 Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World
* John Herbert and                                                                                              42
Motivation
                                  Trove users’ report


            • “I enjoy the correction - it’s a great way to learn more
            about past history and things of interest whilst doing a
            ‘service to the community’ by correcting text for the benefit
            of others.”
            • “I have recently retired from IT and thought that I could be
            of some assistance to the project. It benefits me and other
            people. It helps with family research.”




From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
Friday, January 25, 13                                                                        43
Motivation
                                  CDNC users’ report


             “I am interested in all kinds of history. I have pursued genealogy
              as a hobby for many years. I correct text at CDNC because I see
               it as a constructive way to contribute to a worthwhile project.
                        Because I am interested in history, I enjoy it.”
                                               Wesley, California




 Personal communications with CDNC text correctors.
Friday, January 25, 13                                                            44
Motivation
                                  CDNC users’ report

                   “I only correct the text on articles of local interest - nothing at
                    state, national or international level, no advertisements, etc. 
                    The objective is to be able to help researchers to locate local
                     people, places, organizations and events using the on-line
                  search at CDNC.  I correct local news & gossip, personal items,
                 real estate transactions, superior court proceedings, county and
                   local board of supervisors meetings, obituaries, birth notices,
                                    marriages, yachting news, etc.”
                                                      Ann, California




 Personal communications with CDNC text correctors.
Friday, January 25, 13                                                                   45
Motivation
                                  CDNC users’ report
              “I am correcting text for the Coronado Tent City Program for
               1903.  It is important to correct any problems with personal
              names and other information so that researchers will be able
                to search by keyword and be assured of retrieving desired
                   results. ... type fonts cause a great deal of difficulty in
             digitizing the text and can cause problems for searchers.  Also,
                  many of the guests' names at Tent City and Hotel Del
             Coronado were taken from the registration books and reported
              in the Program.  This led to many problems in spelling of last
             names and the editors were not careful to be consistent in the
                spellings.  This Program is an important resource since it
                provides an excellent picture of daily life in Tent City and
                     captures much of the history of Coronado itself.”
                                                Gene, California
 Personal communications with CDNC text correctors.
Friday, January 25, 13                                                           46
Motivation
                                  CDNC users’ report


                 “I have always been interested in history, especially the
             development of the American West, and nothing brings it alive
               better than newspapers of the time. I believe them to be an
             invaluable source of knowledge for us and future generations.”
                                          David, United Kingdom




 Personal communications with CDNC text correctors.
Friday, January 25, 13                                                        47
Motivation
                                  CDNC users’ report

                     CDNC is an excellent source of information matching my
                    personal interest in such topics as sea history, development
                           of shipbuilding, clippers and other ships etc. ...
                       Unfortunately, the quality of text ... is rather poor I’m
                     afraid. This is why I started to do all corrections necessary
                        for myself ... and to leave the corrected text for use of
                     others. .... I am not doing this very regularly as this is just
                                        my hobby and pleasure.
                                                 Jerzey, Poland




 Personal communications with CDNC text correctors.
Friday, January 25, 13                                                                 48
Website traffic




Friday, January 25, 13                     49
Website traffic

            After a crowdsourcing transcription project of diaries from the
            American War Between the States, Nicole Saylor, Head of Digital
            Library Services at the University of Iowa Libraries, reported



                         “On June 9, 2011, we went from about 1000
                         daily hits to our digital library on a really good
                         day to more than 70,000.”


  Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post
  at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/.
  Dec 6, 2011.

Friday, January 25, 13                                                                                                    50
Website traffic
                         Website traffic at CDNC before / after implementing
                                             crowdsourcing


                                     before crowdsourcing       after crowdsourcing
                                                                                         change
                                   11-Jun-2011 / 12-Jul-2011 11-Jun-2012 / 12-Jul-2012

                         visits            17,485                    21,488              +22.9%

               unique visitors             11,381                    13,376              +17.5%

                 visit duration           9m 24s                     11m 7s              +18.3%

                  bounce rate              51.3%                     44.5%               -6.8%

                pages per visit             14.9                       11.7              -21.5%




Friday, January 25, 13                                                                            51
Website traffic




Friday, January 25, 13                     52
Crowdsourcing
                            benefits




                                 Public domain photo courtesy of US Navy
Friday, January 25, 13                                                 53
$
                              Economics

                  Financial value of outsourced OCR text correction
                  for newspapers?
                  The Assumptions
         • 25 to 50 characters per line in a newspaper column:
           Assume 40 characters per line (CDNC sample average)
         • Outsourced text transcription or correction costs USD
           $0.35 to $1.20 per 1000 characters: Assume $0.50
           per 1000 characters



Friday, January 25, 13                                                54
$
                               Economics


                         $ 578,000 lines x 40 characters per line x
                           1/1000 x $0.50 = $11,560
                         $ 68,908,757 lines x 40 characters per line x
                           1/1000 x $0.50 = $1,378,175




Friday, January 25, 13                                                   55
$
                                         Economics

                     Financial value of in-house OCR text
                     correction?
                     The Assumptions
             • Correction takes 15 seconds per line
             • Cost is hourly wage plus benefits of lowest level
               employee, $10 for CDNC, $41.88* for Australia


 AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs
 due to crowdsourced OCR text correction in its 2012 Trove Status Report.
Friday, January 25, 13                                                                                                       56
$
                                  Economics


                         $ 578,000 lines x 15 seconds per line x 1/3600 hrs
                           per second x $10.00 per hr = $24,083
                         $ 68,908,757 lines x 15 seconds per line x 1/3600
                           hrs per second x $41.88 per hr = $12,024,578




Friday, January 25, 13                                                        57
Accuracy



                         “His Accuracy Depends on Ours!"
                         Office for Emergency Management. Office of War
                         Information. Domestic Operations Branch. Bureau of
                         Special Services. [Photo held at US National Archives and
                         Records Administration]



Friday, January 25, 13                                                               58
Accuracy

             • Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
             reports raw OCR character accuracies of 68% for early 20th
             century newspapers
             • Rose Holley (National Library of Australia) reports raw
             OCR character accuracy varied from 71% to 98% on a
             sample Trove digitized newspapers



 Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.
 Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation
 programs. D-Lib Magazine. March/April 2009.
 Public domain graphic courtesy of Wikimedia Commons.

Friday, January 25, 13                                                                                                    59
Accuracy
                         Mapping texts* assesses digitization quality of digital
                           newspapers by comparing the number of words
                          recognized to the total number of words scanned




 *Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting
 with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.

Friday, January 25, 13                                                                                                    60
Accuracy
         How does low text accuracy affect search recall?
         The Facts
         • Average uncorrected OCR character accuracy of the
           CDNC sample data is ~89%
         • Average length of an English word is 5 characters
         • Average word accuracy is 89% x 89% x 89% x 89% x
           89% = 55.8% - round up to 60% or 6 out of 10 words
           correct


Public domain graphic courtesy of Wikimedia Commons.
Friday, January 25, 13                                          61
Search recall no text correction


                                                              ARNDT




                                     ARNDT           ARNDT
                         ARNDT   ARNDT
                                             ARNDT


                                         ARNDT        ARNDT




                                                                      ARNDT



                                         ARNDT




  instances of “ARNDT” found                            instances of “ARNDT” not found

Friday, January 25, 13                                                                   62
Accuracy


         The Facts
         • Average corrected character accuracy of the CDNC
           sample data is ~99.4%
         • Average word accuracy of CDNC corrected text is
           99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%




Public domain graphic courtesy of Wikimedia Commons.
Friday, January 25, 13                                        63
Search recall with text correction




                                           ARNDT

                                   ARNDT               ARNDT
                         ARNDT ARNDT
                                            ARNDT
                                               ARNDT
                                       ARNDT            ARNDT



                                           ARNDT




  instances of “ARNDT” found                              instances of “ARNDT” not found

Friday, January 25, 13                                                                     64
Accuracy

            A search for “Arndt” at Chronicling America
            gives 10,267 results*
            • If Chronicling America text accuracy is 55.8% (same
              as uncorrected CDNC sample), then 8,133 instances
              of “Arndt” were not found
            • If text accuracy is 97.0%, then 317 instances of
              “Arndt” were not found
        *   Search performed 31 Oct 2012
            Alexa global / country traffic rank of Library of Congress (31-Oct-2012): 4,056 / 1,317
                    Chronicling America gets ~7.1% of all Library of Congress web traffic.

 Public domain graphic courtesy of Wikimedia Commons.
Friday, January 25, 13                                                                                65
Hard-to-measure-but-
       shouldn’t-be-overlooked
              benefits




                    Public domain photo “A useful instruction for young sailors from the Royal Hospital
                    School, Greenwich” from the National Maritime Museum.

Friday, January 25, 13                                                                                    66
HTMBSBO benefit

               “when someone transcribes a document, they are
                actually better fulfilling the mission of a cultural
             heritage organization than someone who simply stops
                          by to flip through the pages”




  Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

Friday, January 25, 13                                                                    67
HTMBSBO benefit

            “in addition to increasing search accuracy or lowering
            the costs of document transcription, crowdsourcing is
           the single greatest advancement in getting people using
                   and interacting with library collections”




  Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

Friday, January 25, 13                                                                    68
Crowdsourcing considerations
    • How to market / advertise
      crowdsourcing?
    • How to motivate
      crowdsourcing volunteers?
    • Is authentication / identity of
      volunteers an issue?
    • How to administer
      crowdsourced data?


                                        Photo of Aleister Crowley [Public domain] from Wikimedia
                                                                Commons

Friday, January 25, 13                                                                        69
Conclusions
              • Lots of crowdsourcing in cultural heritage
                organizations and elsewhere
              • Benefits are multi-faceted: Economic, data
                accuracy, patron engagement, increased web
                traffic




Conclusion of the Sonata for piano #32, opus 111 by
Ludwig van Beethoven

 Friday, January 25, 13                                      70
Try crowdsourcing!
                                  Correct California newspapers text
                                         http://cdnc.ucr.edu

                                  Correct Australian newspapers text
                                       http://trove.nla.gov.au

                                  Correct Cambridge MA newspapers text
                                      http://bit.ly/cambridgepublic

                                  Correct Russian language periodicals
                                    http://bit.ly/russianperiodicals


         Others soon to follow: Library of Virginia, University of Tennessee,
                          National Library of Singapore, ...
Friday, January 25, 13                                                          71
?
                                 Brian Geiger
                               bgeiger@ucr.edu

                                Frederick Zarndt
                         frederick@frederickzarndt.com


                                           Photo held by John Oxley Library, State Library of Queensland. Original from

                                           Courier-mail, Brisbane, Queensland, Australia.

Friday, January 25, 13                                                                                                    72

Weitere ähnliche Inhalte

Ähnlich wie 20130123 Crowdsourcing [hamilton library u of hi]

20121105 no tempest in my teapot [dlf forum denver]
20121105 no tempest in my teapot [dlf forum denver]20121105 no tempest in my teapot [dlf forum denver]
20121105 no tempest in my teapot [dlf forum denver]Frederick Zarndt
 
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...20130321 Putting the world's cultural heritage online with crowdsourcing [roo...
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...Frederick Zarndt
 
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...Frederick Zarndt
 
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]Frederick Zarndt
 
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli FinlandFrederick Zarndt
 
20120821 putting the world’s cultural heritage online with crowd sourcing [na...
20120821 putting the world’s cultural heritage online with crowd sourcing [na...20120821 putting the world’s cultural heritage online with crowd sourcing [na...
20120821 putting the world’s cultural heritage online with crowd sourcing [na...Frederick Zarndt
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...Frederick Zarndt
 
Let the trumpet sound 2003 version
Let the trumpet sound 2003 versionLet the trumpet sound 2003 version
Let the trumpet sound 2003 versionJohan Koren
 
Solr powered libraries a survey of the world's knowledge bases
Solr powered libraries a survey of the world's knowledge basesSolr powered libraries a survey of the world's knowledge bases
Solr powered libraries a survey of the world's knowledge baseslucenerevolution
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered LibrariesErik Hatcher
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012Paige Jaeger
 
User-Generated Experiences
User-Generated ExperiencesUser-Generated Experiences
User-Generated ExperiencesMolly Schwartz
 

Ähnlich wie 20130123 Crowdsourcing [hamilton library u of hi] (20)

20121105 no tempest in my teapot [dlf forum denver]
20121105 no tempest in my teapot [dlf forum denver]20121105 no tempest in my teapot [dlf forum denver]
20121105 no tempest in my teapot [dlf forum denver]
 
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...20130321 Putting the world's cultural heritage online with crowdsourcing [roo...
20130321 Putting the world's cultural heritage online with crowdsourcing [roo...
 
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
 
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
 
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland
2012-08-14 Crowdsourcing National Digitisation Centre Mikkeli Finland
 
20120821 putting the world’s cultural heritage online with crowd sourcing [na...
20120821 putting the world’s cultural heritage online with crowd sourcing [na...20120821 putting the world’s cultural heritage online with crowd sourcing [na...
20120821 putting the world’s cultural heritage online with crowd sourcing [na...
 
Acrl2005
Acrl2005Acrl2005
Acrl2005
 
Acrl2005(1)
Acrl2005(1)Acrl2005(1)
Acrl2005(1)
 
Acrl2005(1)
Acrl2005(1)Acrl2005(1)
Acrl2005(1)
 
Acrl2005(1)
Acrl2005(1)Acrl2005(1)
Acrl2005(1)
 
Acrl2005(1)
Acrl2005(1)Acrl2005(1)
Acrl2005(1)
 
Acrl2005
Acrl2005Acrl2005
Acrl2005
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
 
Let the trumpet sound 2003 version
Let the trumpet sound 2003 versionLet the trumpet sound 2003 version
Let the trumpet sound 2003 version
 
Keynote: Revolution for Sure: Envisioning a 21st Century Information Organiza...
Keynote: Revolution for Sure: Envisioning a 21st Century Information Organiza...Keynote: Revolution for Sure: Envisioning a 21st Century Information Organiza...
Keynote: Revolution for Sure: Envisioning a 21st Century Information Organiza...
 
Solr powered libraries a survey of the world's knowledge bases
Solr powered libraries a survey of the world's knowledge basesSolr powered libraries a survey of the world's knowledge bases
Solr powered libraries a survey of the world's knowledge bases
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered Libraries
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
 
User-Generated Experiences
User-Generated ExperiencesUser-Generated Experiences
User-Generated Experiences
 
Crowdsourcing
CrowdsourcingCrowdsourcing
Crowdsourcing
 

Mehr von Frederick Zarndt

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesFrederick Zarndt
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and PracticesFrederick Zarndt
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017Frederick Zarndt
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesFrederick Zarndt
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]Frederick Zarndt
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Frederick Zarndt
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Frederick Zarndt
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]Frederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Frederick Zarndt
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]Frederick Zarndt
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...Frederick Zarndt
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...Frederick Zarndt
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]Frederick Zarndt
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...Frederick Zarndt
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]Frederick Zarndt
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...Frederick Zarndt
 

Mehr von Frederick Zarndt (20)

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum Archives
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin Principles
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

20130123 Crowdsourcing [hamilton library u of hi]

  • 1. No tempest in my teapot: Analysis of Crowdsourced Data and User Experiences at the California Digital Newspaper Collection Brian Geiger Director, Center for Bibliographical Studies and Research California Digital Newspaper Collection Frederick Zarndt Chair, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia. Friday, January 25, 13 1
  • 2. Try crowdsourcing! Correct California newspapers text http://cdnc.ucr.edu Correct Australian newspapers text http://trove.nla.gov.au Correct Cambridge MA newspapers text http://bit.ly/cambridgepublic Correct Russian language periodicals http://bit.ly/russianperiodicals Friday, January 25, 13 2
  • 4. The Wisdom of Crowds In 2004 James Surowiecki published “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations”. In it he asserts a crowd of persons that are diverse, independent, and decentralized usually make better judgements or decisions than single persons Friday, January 25, 13 4
  • 5. “crowdsourcing” was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired magazine June 2006. Friday, January 25, 13 5
  • 6. A Google advanced search for “crowdsourcing” from 1-Jun-2006, the date of publication of Jeff Howe’s Wired magazine article, to 1-Jun-2007 gives 44,600 hits. A date range of 1-Jun-2011 to 1-Jun-2012 gives 2,680,000 hits. Searches used the Internet Archives’ Wayback Machine Friday, January 25, 13 6
  • 7. Crowdsourcing is a process that involves outsourcing tasks to a distributed group of people. ... the difference between crowdsourcing and ordinary outsourcing is that a task or problem is outsourced to an undefined public rather than a specific body, such as paid employees. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed June 1, 2012) Friday, January 25, 13 7
  • 8. Crowdsourcing is a type of participative online activity in which an individual, an institution, a non-profit organization, or company proposes to a group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The undertaking of the task, of variable complexity and modularity, and in which the crowd should participate bringing their work, money, knowledge and/or experience, always entails mutual benefit. The user will receive the satisfaction of a given type of need, be it economic, social recognition, self-esteem, or the development of individual skills, while the crowdsourcer will obtain and utilize to their advantage that what the user has brought to the venture, whose form will depend on the type of activity undertaken. Enrique Estellés-Arolas and Fernando González-Ladrón-de-Guevara. Towards an integrated crowdsourcing definition. Journal of Information Science XX(X). 2012. pp. 1-14. Friday, January 25, 13 8
  • 9. crowdcollaboration crowd* crowdsourcing ng di cr citizen science un df ow ow dl cr ea rn in crowdcasting g crowdvoting Friday, January 25, 13 9
  • 10. what is Alexa? • Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon. • Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat. • Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine. • Alexa continually crawls all publicly-available websites to create a series of snapshots of the web. • Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide). Friday, January 25, 13 10
  • 11. definitions • A PageView is a request for a file whose type is defined as a page. • A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale. • A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request. • Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between. • World | Country Rank is a function of the average daily unique visits and the number of unique pages requested. definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics Friday, January 25, 13 11
  • 12. crowdsourcing Amazon Mechanical Turk was launched Nov 2005 Alexa global rank of Amazon Mechanical Turk (23-Jan-2013): 8,346 Friday, January 25, 13 12
  • 13. crowdsourcing Each day 100,000,000 recaptcha’s are served to websites around the world recaptcha is used by more than 200,000 websites Friday, January 25, 13 13
  • 14. crowdvoting Iowa Electronic Market was 1st launched in 1995 Alexa global traffic rank of Iowa Electronic Market (6-Aug-2012): 11,290 Alexa US traffic rank of Iowa Electronic Market (6-Aug-2012): 3,923 Friday, January 25, 13 14
  • 15. citizen science Galaxy Zoo was 1st launched July 2007 Alexa global traffic rank of Galaxy Zoo (13-Jun-2012): 557,766 Friday, January 25, 13 15
  • 16. crowdfunding Kickstarter was 1st launched in 2008 Alexa global traffic rank of Kickstarter (6-Aug-2012): 752 27,528 projects successfully funded with more than USD $254,000,000 Friday, January 25, 13 16
  • 17. crowdlearning duolingo was 1st launched Nov 2011 (private beta) / Jun 2012 (public) Alexa global / country traffic rank of duolingo (23-Jan-2013): 22,052 / 11,761 Friday, January 25, 13 17
  • 19. Wikipedia • Began 2001 • Now in 285 languages • 3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in French, 1,050,000 in Dutch • 40 wikipedia languages with more than 100,000 articles • 112 wikipedia languages with more than 10,000 articles • 400,000,000 unique visitors per month • 85,000 active contributors • Alexa global traffic rank: #6 in worldwide, #7 in USA web traffic (23-Jan-2013) Friday, January 25, 13 19
  • 21. Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (13-Jun-2012): 4,352 / 1,357 Friday, January 25, 13 21
  • 22. • Started (beta) 2004 • More than 780,000 worldwide registered volunteers from ~25 countries index records relevant to family history • Approximately 100,000 active volunteers each month • UI in Chinese, English, German, French, Italian, Japanese, Korean, Portuguese, and Russian • Blind double-key entry with arbitration / reconciliation • More than 1,500,088,741 records indexed (July 2012) • Accuracy typically > 99.95% Friday, January 25, 13 22
  • 23. Project Gutenberg was 1st launched Dec 1971 Alexa global traffic rank of Project Gutenberg (13-Jun-2012): 5,744 Friday, January 25, 13 23
  • 24. • Started Dec 1971 • Worldwide volunteers transcribe or proofread OCR’d public domain books through Distributed Proofreaders • 40,000 books completed (July 2012) • Partner / affiliated projects for Australia, Canada, Europe, Germany, Luxembourg, Philippines, Runeberg (Nordic literature), Russia, Taiwan Friday, January 25, 13 24
  • 25. Alexa global / country traffic rank of National Library of Australia (31-Oct-2012): 15,519 / 406 Trove gets ~72% of all National Library web traffic. Friday, January 25, 13 25
  • 26. National Library of Australia • Online since 2008 • 7,200,000+ pages • Top text corrector 1,250,000 lines (June 2012) • 2,450,000+ lines corrected each month (average for 1st 6 months 2012) • 68,908,757 lines corrected as of July 2012, up from 42,411,468 lines corrected July 2011. • 63,613 total registered users (July 2012) • 4,146 active users (June 2012) Friday, January 25, 13 26
  • 27. Alexa global / country traffic rank of National Library of Finland 2,535,854 (31-Oct-2012) / 199 (2-Apr-2012) Friday, January 25, 13 27
  • 28. National Library of Finland • Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games! • Digitalkoot is a collaboration between the National Library and Microtask • Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt) • National Library has 4,000,000+ digitized pages • 109,321 registered players (October 2012) • Since February 2011 8,024,530 micro-tasks have been completed Friday, January 25, 13 28
  • 29. Alexa global / country traffic rank of UC Riverside (31-Oct-2012): 12,439 / 4,717 CDNC gets ~1.84% of all UC Riverside web traffic. Friday, January 25, 13 29
  • 30. California Digital Newspaper Collection • CDNC began digitizing newspapers in 2005 as part of NDNP • Newspapers digitized to article-level as well as to page-level as required by NDNP • Hosted on Veridian beginning 2009 • Collection size 55,970 issues, 495,175 pages, 5,658,224 articles, 498,000,000+ lines Friday, January 25, 13 30
  • 31. OCR text correction • OCR text correction added August 2011 • Corrections are done line by line • ~578,000+ lines of text corrected (Oct 2012) • ~1.1% of the collection corrected, 98.9% to go! • Top corrector 243,000 lines > 2x 2nd corrector Friday, January 25, 13 31
  • 32. User Lines corrected Lines corrected User 1 242,965 1,456,906 1 2 87,515 1,385,369 2 3 31,318 1,010,360 3 4 24,144 960,230 4 5 23,184 847,340 5 6 19,240 786,147 6 7 18,898 657,187 7 8 16,875 600,513 8 9 11,784 582,276 9 10 9,762 565,384 10 Number of lines corrected as of Oct 2012 Friday, January 25, 13 32
  • 33. uncorrected OCR accuracy by newspaper title OCR character ~OCR word Title accuracy accuracy* PRP Pacific Rural Press 1871 - 1922 92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 - 1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4% CFJ California Farmer and Journal 86.5% 48.4% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 17.3% *Word accuracy assumes average word length is 5 characters Friday, January 25, 13 33
  • 34. OCR accuracy by newspaper title OCR character Corrected Title accuracy accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3% SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899 88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ California Farmer and Journal 86.5% 99.8% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 100.0% Friday, January 25, 13 34
  • 35. corrected accuracy by newspaper title OCR character ~OCR word Corrected ~Corrected Title accuracy accuracy* accuracy word accuracy* PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH 1877 - 1899 88.6% 54.6% 99.9% 99.5% DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5% CF 1855 - 1880 86.5% 48.4% 98.3% 91.8% SN 1885 - 1922 70.4% 17.3% 100.0% 100.0% *Word accuracy assumes average word length is 5 characters Friday, January 25, 13 35
  • 36. correction accuracy by user Average OCR Correction User accuracy accuracy A 70.4% 100.0% B 87.1% 99.5% C 95.4% 99.5% D 86.5% 98.3% E 95.3% 100.0% F 91.0% 100.0% G 91.0% 99.8% H 90.5% 99.0% I 96.6% 99.8% J 94.8% 100.0% K 86.8% 99.3% Friday, January 25, 13 36
  • 37. the long of crowdsourced tail * OCR text correction a probability distribution has a long tail if a larger share of population rests within its tail than it would under a normal distribution the most productive users represent a small fraction of the total user population and ~50% of total production, or, said a different way, the largest fraction but individually not quite so productive users are as important as the most productive users The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”. Friday, January 25, 13 37
  • 38. OCR text correction long tails 3,000,000 2,250,000 50% 300000 top corrector 242,965 1,500,000 top corrector 1,456,906 225000 50% 750,000 150000 50% 0 75000 NLA lines corrected by text corector 50% 0 CDNC lines corrected by text corrector Friday, January 25, 13 38
  • 39. Motivation Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.” Friday, January 25, 13 39
  • 40. Wisdom of crowds Each person should have private information Diversity even if it's just an eccentric interpretation of the known facts. People's opinions aren't determined by the Independence opinions of those around them. People are able to specialize and draw on local Decentralization knowledge. Some mechanism exists for turning private Aggregation judgments into a collective decision. James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005. Friday, January 25, 13 40
  • 41. Cognitive surplus ... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ... ... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ... ... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ... Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010. Friday, January 25, 13 41
  • 42. Motivation Genealogists and family historians • National Library of Australia’s 2012 Trove status report showed that ~50% of Trove users are family historians PAPERSPAST • National Library of New Zealand survey found that ~50% of PapersPast users are genealogists • California Digital Newspaper Collection spring 2012 survey discovered that ~70% of its users are genealogists; 75% are 50 years old or older • A Utah Digital Newspapers survey showed that 72% of its users are genealogists Friday, January 25, 13 Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World * John Herbert and 42
  • 43. Motivation Trove users’ report • “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.” • “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.” From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009. Friday, January 25, 13 43
  • 44. Motivation CDNC users’ report “I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project. Because I am interested in history, I enjoy it.” Wesley, California Personal communications with CDNC text correctors. Friday, January 25, 13 44
  • 45. Motivation CDNC users’ report “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.” Ann, California Personal communications with CDNC text correctors. Friday, January 25, 13 45
  • 46. Motivation CDNC users’ report “I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and captures much of the history of Coronado itself.” Gene, California Personal communications with CDNC text correctors. Friday, January 25, 13 46
  • 47. Motivation CDNC users’ report “I have always been interested in history, especially the development of the American West, and nothing brings it alive better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.” David, United Kingdom Personal communications with CDNC text correctors. Friday, January 25, 13 47
  • 48. Motivation CDNC users’ report CDNC is an excellent source of information matching my personal interest in such topics as sea history, development of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of others. .... I am not doing this very regularly as this is just my hobby and pleasure. Jerzey, Poland Personal communications with CDNC text correctors. Friday, January 25, 13 48
  • 50. Website traffic After a crowdsourcing transcription project of diaries from the American War Between the States, Nicole Saylor, Head of Digital Library Services at the University of Iowa Libraries, reported “On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000.” Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/. Dec 6, 2011. Friday, January 25, 13 50
  • 51. Website traffic Website traffic at CDNC before / after implementing crowdsourcing before crowdsourcing after crowdsourcing change 11-Jun-2011 / 12-Jul-2011 11-Jun-2012 / 12-Jul-2012 visits 17,485 21,488 +22.9% unique visitors 11,381 13,376 +17.5% visit duration 9m 24s 11m 7s +18.3% bounce rate 51.3% 44.5% -6.8% pages per visit 14.9 11.7 -21.5% Friday, January 25, 13 51
  • 53. Crowdsourcing benefits Public domain photo courtesy of US Navy Friday, January 25, 13 53
  • 54. $ Economics Financial value of outsourced OCR text correction for newspapers? The Assumptions • 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average) • Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters Friday, January 25, 13 54
  • 55. $ Economics $ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560 $ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175 Friday, January 25, 13 55
  • 56. $ Economics Financial value of in-house OCR text correction? The Assumptions • Correction takes 15 seconds per line • Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report. Friday, January 25, 13 56
  • 57. $ Economics $ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083 $ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578 Friday, January 25, 13 57
  • 58. Accuracy “His Accuracy Depends on Ours!" Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration] Friday, January 25, 13 58
  • 59. Accuracy • Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers • Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008. Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009. Public domain graphic courtesy of Wikimedia Commons. Friday, January 25, 13 59
  • 60. Accuracy Mapping texts* assesses digitization quality of digital newspapers by comparing the number of words recognized to the total number of words scanned *Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers. Friday, January 25, 13 60
  • 61. Accuracy How does low text accuracy affect search recall? The Facts • Average uncorrected OCR character accuracy of the CDNC sample data is ~89% • Average length of an English word is 5 characters • Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct Public domain graphic courtesy of Wikimedia Commons. Friday, January 25, 13 61
  • 62. Search recall no text correction ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found Friday, January 25, 13 62
  • 63. Accuracy The Facts • Average corrected character accuracy of the CDNC sample data is ~99.4% • Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0% Public domain graphic courtesy of Wikimedia Commons. Friday, January 25, 13 63
  • 64. Search recall with text correction ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found Friday, January 25, 13 64
  • 65. Accuracy A search for “Arndt” at Chronicling America gives 10,267 results* • If Chronicling America text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found • If text accuracy is 97.0%, then 317 instances of “Arndt” were not found * Search performed 31 Oct 2012 Alexa global / country traffic rank of Library of Congress (31-Oct-2012): 4,056 / 1,317 Chronicling America gets ~7.1% of all Library of Congress web traffic. Public domain graphic courtesy of Wikimedia Commons. Friday, January 25, 13 65
  • 66. Hard-to-measure-but- shouldn’t-be-overlooked benefits Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum. Friday, January 25, 13 66
  • 67. HTMBSBO benefit “when someone transcribes a document, they are actually better fulfilling the mission of a cultural heritage organization than someone who simply stops by to flip through the pages” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/ Friday, January 25, 13 67
  • 68. HTMBSBO benefit “in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is the single greatest advancement in getting people using and interacting with library collections” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/ Friday, January 25, 13 68
  • 69. Crowdsourcing considerations • How to market / advertise crowdsourcing? • How to motivate crowdsourcing volunteers? • Is authentication / identity of volunteers an issue? • How to administer crowdsourced data? Photo of Aleister Crowley [Public domain] from Wikimedia Commons Friday, January 25, 13 69
  • 70. Conclusions • Lots of crowdsourcing in cultural heritage organizations and elsewhere • Benefits are multi-faceted: Economic, data accuracy, patron engagement, increased web traffic Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven Friday, January 25, 13 70
  • 71. Try crowdsourcing! Correct California newspapers text http://cdnc.ucr.edu Correct Australian newspapers text http://trove.nla.gov.au Correct Cambridge MA newspapers text http://bit.ly/cambridgepublic Correct Russian language periodicals http://bit.ly/russianperiodicals Others soon to follow: Library of Virginia, University of Tennessee, National Library of Singapore, ... Friday, January 25, 13 71
  • 72. ? Brian Geiger bgeiger@ucr.edu Frederick Zarndt frederick@frederickzarndt.com Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia. Friday, January 25, 13 72