The document describes NewsIndexer, a system that filters and categorizes news content using a specialized thesaurus and rulebase from Thesaurus Master and M.A.I. to manage the massive flow of information. NewsIndexer's vocabulary contains over 5,200 terms across nine levels to reflect typical news coverage, and its rulebase is customized for news topics to distinguish between homographs and apply precise taxonomy terms. Filtering the news data in this way reduces noise, disambiguates terms, limits unnecessary detail, and directs the data to targeted recipients for more accurate retrieval and filtering of information.
2. NewsIndexer –
a case study in filtering
Filters / categorizes / tags news content
Manages massive information flow
Based on Thesaurus Master and M.A.I.
Specialized thesaurus
Specialized rulebase
3. NewsIndexer’s vocabulary
Broad and general subject matter
Reflects coverage of typical news publications
Over 5200 terms, nine levels deep
Six top level categories
Geographic terms
Starter vocabulary
Easily adapted and customized
4.
5. NewsIndexer’s brain
M.A.I. rulebase customized for news topics
Words in text trigger M.A.I. rules
Conditions in rules determine precise
taxonomy term(s) to apply
Rules capture human knowledge and analysis
Rules use context to distinguish between
homographs
Chicago Bears
Bear market
Bears in the woods
6.
7.
8. Why filter?
Reduce noise to enhance retrieval precision
Disambiguate homographs to increase accuracy
Limit unnecessary detail to reduce data flow
Direct data to targeted recipients
9. Filter to cut noise
M.A.I. suggests terms as directed by rules
Index with most specific appropriate terms
Result: precision and accuracy in retrieval
10. Filter to disambiguate
Common words used with very different
meanings in different contexts
Utilities –
electricity / water / sewer?
utility software?
Architecture –
of buildings?
of computer systems?
M.A.I. rule conditions differentiate concepts
Information Architect doesn’t want to retrieve
building blueprints
11. I want it ALL!
Rulebase filters data, yields ALL terms that
meet conditions of M.A.I. rules
Editor can select, reject and add terms
Most specific appropriate term – as chosen
by editor – is saved with the document
Subject metadata
XML format
12. Red Sox Crime
Baseball
Elections
Pharmaceuticals
Gun
Health sciences control
Medicine
Law
Antibiotics Major League
Penicillin Baseball
Campaign finance
Politics
13. Taxonomy 2nd level 3rd level 4th level 5th level
Top Term
Health
conditions
Health
Medicine Pharma- Anti-
sciences
ceuticals biotics Penicillin
Medical
facilities
14. Filter to limit detail
Want all terms or a select few?
Roll up terms to the first, second, or third level
in your taxonomy
Up-posting
Good for automatic indexing
Programmers can set filter to reduce detail
15.
16. Pharmaceuticals
Health sciences
Medicine
Antibiotics
Penicillin
17. Pharmaceuticals
AND Antibiotics
AND Penicillin
Health sciences
Medicine
Antibiotics
Penicillin
18. Taxonomy 2nd level 3rd, 4th, and
Top Term 5th levels
Health
conditions Up-post
Penicillin
to
Health Antibiotics third
sciences level
Medicine Pharma-
ceuticals
Narrower terms
go in
Medical Medicine
facilities bucket
19. No details –
just the big picture
Index comprehensively and retain details
BUT
Display only general terms for end user
Display
higher Health sciences
level term Medicine
Pharmaceuticals
Antibiotics Index with
Penicillin most
specific
20. Health sciences
AND Medicine Pharmaceuticals
AND Pharmaceuticals
AND Antibiotics
AND Penicillin
Medicine
Antibiotics
Penicillin
21. Penicillin
Up-post
Antibiotics
to
Pharma- top
ceuticals level
Medicine --
Narrower terms
Health
go in
sciences Health sciences
bucket
22. Filter to direct data
User expresses interest in general topics
e.g., Technology, Environment, Law
Materials indexed with those topics or any or
their Narrower Terms are forwarded
Applications:
User profiles
Interest groups
Specific departments
23. Specialized filtering –
NewsIndexer and IPTC
International Press Telecommunications Council
(IPTC) proposal for NewsCodes
Part of News Industry Text Format (NITF)
~1300 terms describe topics of news articles
Broad coverage (heavy on sports)
NewsIndexer rulebase can apply detailed
NewsIndexer terms and/or IPTC NewsCodes
Comply with growing news standards
Achieve greater detail for news indexing
24. Thesaurus
Master
manages RESULT: RESULT:
custom vocab ALL Higher level
terms categories,
News that reduced
feed meet data stream
M.A.I. -- for portal,
M.A.I. adds rule targeted
metadata conditions users,
(vocab in TM) and other
purposes
Cut noise, Up-post to
disambiguate limit returns
25. Filtering advantages
For the End User
Simpler, more manageable presentation of concepts
Consistent with typical user’s search strategy
Differentiated concepts associated with homographs
Targeted information according to user profile
For the Internal User
Documents retain subject metadata reflecting
granular indexing
Precision search gets precision results