SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Towards a Web Search
     Service for Minority
 Language Communities
                            Baden Hughes
       Department of Computer Science and
                      Software Engineering
               The University of Melbourne
             badenh@csse.unimelb.edu.au


17 January 2006        Hughes @ OpenRoad 2006   1
Diversity in Australia
     Well recognised cultural and linguistic diversity of
     Australia’s population
           SIL Ethnologue
             311 languages (14th edition, 2000)
             318 languages (15th edition, 2005)
             Australia in top 10 countries for linguistic diversity
             ( = languages in a country / languages globally )
           ABS: 364 languages (2005)
     Considerable number of low density languages used
     within immigrant communities

17 January 2006                   Hughes @ OpenRoad 2006              2
Inefficiency of Web Search
     General web search is a low precision activity
     in the best case scenario
           Google: 8 billion web pages
     Web search for materials in lesser-used
     languages is even lower precision than the
     general case
     Web search for minority (“low density”)
     languages is even lower precision again
           Mining the ‘long tail’ of the web is a specialist
           domain of research
17 January 2006              Hughes @ OpenRoad 2006            3
Harvesting vs Enabling
     Previous work in linguistically-oriented data mining
     of web content to create derivative works: corpora,
     dictionaries
           None of these address the low precision issues for
           generalized web search
     Our work is aimed at increasing the likelihood that
     end users searching for resources in minority
     languages on the web will find useful results from
     searching
           Developing use-case specific tools for web search and
           leveraging existing broad coverage web search tools

17 January 2006                Hughes @ OpenRoad 2006              4
Open Language Archives
Community (OLAC)
     OLAC is a consortium of linguistic data archives
           http://www.language-archives.org/
           34 archives, 28K+ objects in catalogue
     OLAC metadata is based on Dublin Core, with
     extensions for specifically linguistically-oriented
     properties eg language, data type, subject
     language, linguistic subject
     OLAC is an Open Archives Initiative (OAI)
     subcommunity
           Uses standard OAI Protocol for Metadata Harvesting to
           promote data access and integration

17 January 2006                Hughes @ OpenRoad 2006              5
In vs About
     OLAC Metadata crucially distinguishes
     between
           The language a resource is in (‘language’)
           The language a resource is about (‘subject
           language’)
     Such differentiation allows for additional
     precision in classifying, indexing and
     searching for low density language resources
           ‘In-ness’ is more interesting than ‘About-ness’
17 January 2006             Hughes @ OpenRoad 2006           6
Service Architecture
     Building on previous work in developing
     robust strategies for identifying web
     resources for lesser used languages on the
     web, the LangGator service architecture
     provides
           Language-centric web resource identification and
           acquisition
           Language-centric resource description
           Language-aware end-user resource discovery
17 January 2006            Hughes @ OpenRoad 2006             7
Crawler Internals
     Crawl seeded by language name variants
     (Ethnologue), place and country names and variants
     (Getty TGN), lexical items (Rosetta)
     Programmatic queries against Google, Yahoo, A9,
     DogPile
           Essentially guided metasearch
     Resulting URIs merged and sorted using rank
     aggregation techniques
     Highly ranked documents from metasearch used for
     focused crawling around URI
           TF/IDF for low frequency items in found documents
17 January 2006               Hughes @ OpenRoad 2006           8
Crawler Status
     Running intermittently since July 2004 on high
     bandwidth research infrastructure
     >1.6 million web resources have been identified in
     over 3000 languages
     Some exposed via standard OLAC search
     Majority exposed to standard search engines via
     DP9 gateway
           Full circle exploitation of web search
           Evaluation of precision improvement is ongoing
     More details in the paper (or Hughes 2005 paper)
17 January 2006                Hughes @ OpenRoad 2006       9
Metadata Descriptions
     Describing resources separately from their
     realization is required since the web based
     language-centric resources are not held locally
     Metadata creation is an effort intensive process
           Automatic description generation is well studied in the
           general digital libraries community (eg Paynter 2005)
     Some metadata elements are well supported by
     existing automatic metadata creation tools
     We focus particularly on language vs subject
     language metadata creation since it is of primary
     importance
17 January 2006                 Hughes @ OpenRoad 2006               10
Metadata Descriptions Status
     We use a combination of machine learning
     approaches to compare and classify a given
     resource against human curated gold standard data
     for known languages
           Primary data points: encoding, word n-grams, character n-
           grams
           Secondary data points: geographical referent colocation,
           lexical item occurrence, URI
     Currently described around 40% of the >1.6 million
     URIs found by crawler at probability of 0.8 or higher
     as threshold for acceptable language identification
           Computationally bound at present, but re-engineering
17 January 2006                Hughes @ OpenRoad 2006              11
Search Facilities
     Currently search delivered via OLAC Search Engine
     (http://www.language-archives.org/tools/search/)
     Features
           Web search style interface, UTF-8 support, no restrictions
           on string, operators, inline syntax
           Fuzzy string matching for geographical entities and
           language names
           ‘Click minimization’ strategy for empty search: pre-
           composed derivative queries
           Exploits Ethnologue and Getty ontologies
           Exploits linguistic knowledge (eg language families)
17 January 2006                Hughes @ OpenRoad 2006                   12
Search Facilities
     Localization-oriented interface
           XML core with XSL
           Entirely user preference driven with a default
           Post-query encoding/language change
           Currently code auditing for upgrading interface
           strings to XLIFF Portable Objects
           Interest for localization into French, Spanish,
           Bahasa Indonesia, Vietnamese, Thai
           More search architecture detail in Kamat and
           Hughes (2005)
17 January 2006             Hughes @ OpenRoad 2006           13
Language Search: Dinka




17 January 2006   Hughes @ OpenRoad 2006   14
Country Search: Togo




17 January 2006   Hughes @ OpenRoad 2006   15
Future Work
     Increased frequency of web crawling
     More efficient and reliable language identification
     End user documentation and accessibility
     API documentation for third party data consumers and
     documentation for service/interface customization
     Map based search GUI; better geographical context-
     aware search
     Linguistically or geographical proximity based
     language matching
     Basic Language Resource Kits (BLARK)
     Integration with MyLanguage
17 January 2006         Hughes @ OpenRoad 2006          16
Conclusion
     Language-centric broad coverage web search is a
     strongly motivated user function
     Major search providers do not focus on precision
     improvement per se, but can be incrementally
     improved through covert means
     A multilingual web and multilingual web users can
     be supported effectively, even down to low densities
     Interested in leveraging our existing research and
     service development in other ways

17 January 2006         Hughes @ OpenRoad 2006          17
Acknowledgements
     Research supported by the Australian
     Research Council under the funding program
     for Special Research Initiatives (E-Research)
     Grant SR0567353 “An Intelligent Search
     Infrastructure for Language Resources on the
     Web”.




17 January 2006         Hughes @ OpenRoad 2006       18

Weitere Àhnliche Inhalte

Was ist angesagt?

Git studynotes
Git studynotesGit studynotes
Git studynotesRichard Kuo
 
IFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableIFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableFigoblog
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked DataEUCLID project
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411charper
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked DataEUCLID project
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaEUCLID project
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesMenzo Windhouwer
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data皓仁 æŸŻ
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM LandscapeShana McDanold
 

Was ist angesagt? (17)

NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
IFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableIFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round table
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data Environment
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
RDA and Linked Data. Gordon Dunsire
RDA and Linked Data. Gordon DunsireRDA and Linked Data. Gordon Dunsire
RDA and Linked Data. Gordon Dunsire
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data Categories
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 

Andere mochten auch

A Real Love Story34
A Real Love Story34A Real Love Story34
A Real Love Story34guest0ecbb7
 
Week 9 Sponges
Week 9 SpongesWeek 9 Sponges
Week 9 SpongesCorey Topf
 
Schleswig June 07
Schleswig  June 07Schleswig  June 07
Schleswig June 07abfrench
 
Building Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBuilding Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBaden Hughes
 
0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered designHans Kemp
 
Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08zappos
 
User Experience Design Introduction
User Experience Design   IntroductionUser Experience Design   Introduction
User Experience Design IntroductionHans Kemp
 
0708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 10708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 1Hans Kemp
 
Zappos - Community 2.0 Conference - 05-13-08
Zappos - Community 2.0 Conference  - 05-13-08Zappos - Community 2.0 Conference  - 05-13-08
Zappos - Community 2.0 Conference - 05-13-08zappos
 
For Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COFor Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COrteam
 
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1   Structuur, Flow En NavigatieIad1 0809 Q3 Hoorcollege 1   Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En NavigatieHans Kemp
 
Week 26 Sponges
Week 26 SpongesWeek 26 Sponges
Week 26 SpongesCorey Topf
 
Daniela delgadoted talk
Daniela delgadoted talkDaniela delgadoted talk
Daniela delgadoted talkCorey Topf
 
Chapin Challenge
Chapin ChallengeChapin Challenge
Chapin ChallengeCorey Topf
 
Week 15 Sponges
Week 15 SpongesWeek 15 Sponges
Week 15 SpongesCorey Topf
 
How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...Johan Ronnestam
 
Greek educational system
Greek educational systemGreek educational system
Greek educational systemGeorge Bekiaridis
 
0708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege10708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege1Hans Kemp
 

Andere mochten auch (20)

A Real Love Story34
A Real Love Story34A Real Love Story34
A Real Love Story34
 
Week 36
Week 36Week 36
Week 36
 
Week 9 Sponges
Week 9 SpongesWeek 9 Sponges
Week 9 Sponges
 
Schleswig June 07
Schleswig  June 07Schleswig  June 07
Schleswig June 07
 
Building Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBuilding Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid Middleware
 
0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design
 
Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08
 
User Experience Design Introduction
User Experience Design   IntroductionUser Experience Design   Introduction
User Experience Design Introduction
 
0708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 10708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 1
 
Zappos - Community 2.0 Conference - 05-13-08
Zappos - Community 2.0 Conference  - 05-13-08Zappos - Community 2.0 Conference  - 05-13-08
Zappos - Community 2.0 Conference - 05-13-08
 
For Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COFor Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley CO
 
Ux intro
Ux introUx intro
Ux intro
 
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1   Structuur, Flow En NavigatieIad1 0809 Q3 Hoorcollege 1   Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
 
Week 26 Sponges
Week 26 SpongesWeek 26 Sponges
Week 26 Sponges
 
Daniela delgadoted talk
Daniela delgadoted talkDaniela delgadoted talk
Daniela delgadoted talk
 
Chapin Challenge
Chapin ChallengeChapin Challenge
Chapin Challenge
 
Week 15 Sponges
Week 15 SpongesWeek 15 Sponges
Week 15 Sponges
 
How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...
 
Greek educational system
Greek educational systemGreek educational system
Greek educational system
 
0708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege10708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege1
 

Ähnlich wie Towards a Web Search Service for Minority Language Communities

Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUCourtney McDonald
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Baden Hughes
 
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Rafal Kasprowski
 
Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Menzo Windhouwer
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Debra Kolah
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc KeynoteJohannes Keizer
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfJackie Gold
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesHannes Ebner
 
Core presentation
Core presentationCore presentation
Core presentationpetrknoth
 
2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMSRYAN T.
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in ArchaeozoologySarah Whitcher Kansa
 

Ähnlich wie Towards a Web Search Service for Minority Language Communities (20)

Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
 
Aos Project and Realization
Aos Project and RealizationAos Project and Realization
Aos Project and Realization
 
Use and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
Use and integration of controlled vocabularies (AGROVOC) in DSpace RepositoriesUse and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
Use and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
 
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
LOD2: Guest presentation: French datalift project
LOD2: Guest presentation: French datalift projectLOD2: Guest presentation: French datalift project
LOD2: Guest presentation: French datalift project
 
Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift lod2-paris-24032011
Datalift lod2-paris-24032011
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc Keynote
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdf
 
Exploring the Evolution and Diversity of Speech Datasets
Exploring the Evolution and Diversity of Speech DatasetsExploring the Evolution and Diversity of Speech Datasets
Exploring the Evolution and Diversity of Speech Datasets
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning Repositories
 
Core presentation
Core presentationCore presentation
Core presentation
 
2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 

Mehr von Baden Hughes

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsBaden Hughes
 
Managing Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewManaging Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewBaden Hughes
 
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...Baden Hughes
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorBaden Hughes
 
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Baden Hughes
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersBaden Hughes
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Baden Hughes
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesBaden Hughes
 
Refactoring Metadata:
Refactoring Metadata:Refactoring Metadata:
Refactoring Metadata:Baden Hughes
 
Change Management and Versioning in Ontologies
Change Management and Versioning in OntologiesChange Management and Versioning in Ontologies
Change Management and Versioning in OntologiesBaden Hughes
 
The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...Baden Hughes
 
Why Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsWhy Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsBaden Hughes
 

Mehr von Baden Hughes (12)

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary Linguistics
 
Managing Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewManaging Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's View
 
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text Editor
 
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML Technologies
 
Refactoring Metadata:
Refactoring Metadata:Refactoring Metadata:
Refactoring Metadata:
 
Change Management and Versioning in Ontologies
Change Management and Versioning in OntologiesChange Management and Versioning in Ontologies
Change Management and Versioning in Ontologies
 
The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...
 
Why Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsWhy Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print Collections
 

KĂŒrzlich hochgeladen

Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataExhibitors Data
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Roland Driesen
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 

KĂŒrzlich hochgeladen (20)

Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
VVVIP Call Girls In Greater Kailash âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
 

Towards a Web Search Service for Minority Language Communities

  • 1. Towards a Web Search Service for Minority Language Communities Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne badenh@csse.unimelb.edu.au 17 January 2006 Hughes @ OpenRoad 2006 1
  • 2. Diversity in Australia Well recognised cultural and linguistic diversity of Australia’s population SIL Ethnologue 311 languages (14th edition, 2000) 318 languages (15th edition, 2005) Australia in top 10 countries for linguistic diversity ( = languages in a country / languages globally ) ABS: 364 languages (2005) Considerable number of low density languages used within immigrant communities 17 January 2006 Hughes @ OpenRoad 2006 2
  • 3. Inefficiency of Web Search General web search is a low precision activity in the best case scenario Google: 8 billion web pages Web search for materials in lesser-used languages is even lower precision than the general case Web search for minority (“low density”) languages is even lower precision again Mining the ‘long tail’ of the web is a specialist domain of research 17 January 2006 Hughes @ OpenRoad 2006 3
  • 4. Harvesting vs Enabling Previous work in linguistically-oriented data mining of web content to create derivative works: corpora, dictionaries None of these address the low precision issues for generalized web search Our work is aimed at increasing the likelihood that end users searching for resources in minority languages on the web will find useful results from searching Developing use-case specific tools for web search and leveraging existing broad coverage web search tools 17 January 2006 Hughes @ OpenRoad 2006 4
  • 5. Open Language Archives Community (OLAC) OLAC is a consortium of linguistic data archives http://www.language-archives.org/ 34 archives, 28K+ objects in catalogue OLAC metadata is based on Dublin Core, with extensions for specifically linguistically-oriented properties eg language, data type, subject language, linguistic subject OLAC is an Open Archives Initiative (OAI) subcommunity Uses standard OAI Protocol for Metadata Harvesting to promote data access and integration 17 January 2006 Hughes @ OpenRoad 2006 5
  • 6. In vs About OLAC Metadata crucially distinguishes between The language a resource is in (‘language’) The language a resource is about (‘subject language’) Such differentiation allows for additional precision in classifying, indexing and searching for low density language resources ‘In-ness’ is more interesting than ‘About-ness’ 17 January 2006 Hughes @ OpenRoad 2006 6
  • 7. Service Architecture Building on previous work in developing robust strategies for identifying web resources for lesser used languages on the web, the LangGator service architecture provides Language-centric web resource identification and acquisition Language-centric resource description Language-aware end-user resource discovery 17 January 2006 Hughes @ OpenRoad 2006 7
  • 8. Crawler Internals Crawl seeded by language name variants (Ethnologue), place and country names and variants (Getty TGN), lexical items (Rosetta) Programmatic queries against Google, Yahoo, A9, DogPile Essentially guided metasearch Resulting URIs merged and sorted using rank aggregation techniques Highly ranked documents from metasearch used for focused crawling around URI TF/IDF for low frequency items in found documents 17 January 2006 Hughes @ OpenRoad 2006 8
  • 9. Crawler Status Running intermittently since July 2004 on high bandwidth research infrastructure >1.6 million web resources have been identified in over 3000 languages Some exposed via standard OLAC search Majority exposed to standard search engines via DP9 gateway Full circle exploitation of web search Evaluation of precision improvement is ongoing More details in the paper (or Hughes 2005 paper) 17 January 2006 Hughes @ OpenRoad 2006 9
  • 10. Metadata Descriptions Describing resources separately from their realization is required since the web based language-centric resources are not held locally Metadata creation is an effort intensive process Automatic description generation is well studied in the general digital libraries community (eg Paynter 2005) Some metadata elements are well supported by existing automatic metadata creation tools We focus particularly on language vs subject language metadata creation since it is of primary importance 17 January 2006 Hughes @ OpenRoad 2006 10
  • 11. Metadata Descriptions Status We use a combination of machine learning approaches to compare and classify a given resource against human curated gold standard data for known languages Primary data points: encoding, word n-grams, character n- grams Secondary data points: geographical referent colocation, lexical item occurrence, URI Currently described around 40% of the >1.6 million URIs found by crawler at probability of 0.8 or higher as threshold for acceptable language identification Computationally bound at present, but re-engineering 17 January 2006 Hughes @ OpenRoad 2006 11
  • 12. Search Facilities Currently search delivered via OLAC Search Engine (http://www.language-archives.org/tools/search/) Features Web search style interface, UTF-8 support, no restrictions on string, operators, inline syntax Fuzzy string matching for geographical entities and language names ‘Click minimization’ strategy for empty search: pre- composed derivative queries Exploits Ethnologue and Getty ontologies Exploits linguistic knowledge (eg language families) 17 January 2006 Hughes @ OpenRoad 2006 12
  • 13. Search Facilities Localization-oriented interface XML core with XSL Entirely user preference driven with a default Post-query encoding/language change Currently code auditing for upgrading interface strings to XLIFF Portable Objects Interest for localization into French, Spanish, Bahasa Indonesia, Vietnamese, Thai More search architecture detail in Kamat and Hughes (2005) 17 January 2006 Hughes @ OpenRoad 2006 13
  • 14. Language Search: Dinka 17 January 2006 Hughes @ OpenRoad 2006 14
  • 15. Country Search: Togo 17 January 2006 Hughes @ OpenRoad 2006 15
  • 16. Future Work Increased frequency of web crawling More efficient and reliable language identification End user documentation and accessibility API documentation for third party data consumers and documentation for service/interface customization Map based search GUI; better geographical context- aware search Linguistically or geographical proximity based language matching Basic Language Resource Kits (BLARK) Integration with MyLanguage 17 January 2006 Hughes @ OpenRoad 2006 16
  • 17. Conclusion Language-centric broad coverage web search is a strongly motivated user function Major search providers do not focus on precision improvement per se, but can be incrementally improved through covert means A multilingual web and multilingual web users can be supported effectively, even down to low densities Interested in leveraging our existing research and service development in other ways 17 January 2006 Hughes @ OpenRoad 2006 17
  • 18. Acknowledgements Research supported by the Australian Research Council under the funding program for Special Research Initiatives (E-Research) Grant SR0567353 “An Intelligent Search Infrastructure for Language Resources on the Web”. 17 January 2006 Hughes @ OpenRoad 2006 18