SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Text Mining 1


Running head: TEXT MINING




                                      Text Mining



                                      Mark Sharp

      Rutgers University, School of Communication, Information and Library Studies
Text Mining 2


                                              Abstract

The general idea of text mining – getting small "nuggets" of desired information out of

"mountains" of textual data without having to read it all – is nearly as old as information retrieval

(IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the

Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be

viewed as one of a class of nontraditional IR strategies which attempt to treat entire text

collections holistically, avoid the bias of human queries, objectify the IR process with principled

algorithms, and "let the data speak for itself." These strategies share many techniques such as

semantic parsing and statistical clustering, and the boundaries between them are fuzzy.

Therefore in this paper several related concepts are briefly reviewed in addition to text mining

proper, including data mining, machine learning, natural language processing, text

summarization, template mining, theme finding, text categorization, clustering, filtering, text

visualization, and text compression. Current text mining systems per se appear to be fairly

primitive, but to have the following goals which may serve as a useful definition to distinguish

text mining from other IR concepts: (1) to operate on large, natural language text collections; (2)

to use principled algorithms more than heuristics and manual filtering; (3) to extract

phenomenological units of information (e.g., patterns) rather than or in addition to documents;

(4) to discover new knowledge. Interest in text mining for biomedical research purposes is

especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining

systems designed for use with science and technology text databases such as MEDLINE

currently seem to have an undue emphasis on expert human filtering which contradicts goal (2).

Whether this represents premature surrender to difficulty or a necessary temporary expedient

remains to be seen.
Text Mining 3
Text Mining 4


                                            Text Mining

                                         Why Text Mining?

       It has become a cliché to describe information space and the challenge of navigating it in

dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with

regard to scientific, technical, and scholarly literature. We moderns may like to think we are the

first to face this problem, but scientists have always complained about keeping up with their

literature (Saracevic, 2001). The promise of better science through better information technology

has been a major theme in information science since Vannevar Bush (1945) proposed his famous

Memex machine to deal with the "growing mountain of research."

       Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and

difficult to deal with" but also "the most common vehicle for formal exchange of information."

Therefore, the "motivation for trying to extract information from it is compelling – even if success

is only partial …. Whereas data mining belongs in the corporate world because that's where most

databases are, text mining promises to move machine learning technology out of the companies

and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as

"web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a

current review of web data extraction tools.

       Text mining is one of a class of what I will call "nontraditional information retrieval (IR)

strategies." The goal of these strategies is to reduce the effort required of users to obtain useful

information from large computerized text data sources. Traditional IR often simultaneously

retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas,

2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a

truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the
Text Mining 5


entire database or collection more holistically, recognizing that the selectivity of anthropogenic

queries has a downside or bias which can be counterproductive to obtaining the best information,

and attempting to "objectify" the IR process with principled algorithms.1 I like to think that they

try to "let the data speak for itself."

         When I started to research this paper I made a list of all the IR concepts (traditional and

non-) that were explicitly related to text mining by the first wave of authorities I identified. It was

a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and

thus define their boundaries and hierarchical relationships to text mining. However, it soon

became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and

even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of

truth.2 Therefore I decided to try to cover them all instead of focusing on text mining proper,

whatever that turned out to be. Fortunately, time and literature resource limitations intervened to

significantly curtail this plan. Hopefully the result will serve as a sensible compromise.



                                                   History of Text Mining

         H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power

of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text

mining and related methods when he said that "natural characterization and organization of

information can come from analysis of frequencies and distributions of words in libraries"
1
  E.g., "'Objectivity' [means] the results solely depend on the outcome of the linguistic processing algorithms and
statistical calculations" (Dorre, Gerstl, & Seiffert, 1999). I recognize that such computational exotica, stripped of their
mathematical mystique, "can be regarded as a form of transformed cognitive structure" (Ingwersen & Willett, 1995)
and are therefore ultimately just as human and arbitrary as the traditional methods. But I also believe that there can be
degrees of objectivity (operationally defined as general validity or utility) and that in general abstract computational
approaches will tend to be more objective.
2
 There is one website, however, that goes too far. Greenfield (2001) lists virtually every text processing and
database technology I have ever heard of under the title "Text Mining." As a kind of rite of passage into the subject,
Patrick Perrin asked me to look at it and tell him if all of that was really text mining, so apparently it's somewhat
notorious in the field.
Text Mining 6


("libraries" representing what we would now more generally call collections or corpora). Text

mining per se may be new, but the dream of training a computer to extract information from

"mountains" of textual data is nearly as old as IR itself.

       Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded

as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted

scientists' attitudes toward information usage with those of intelligence analysts.

       'To the working scientist or engineer, time spent gathering information or writing reports is
       often regarded as a wasteful encroachment on time that would otherwise be spent
       producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence
       analyst, by contrast, is much more intimate with the available base of recorded information.
       New knowledge, or finished intelligence, is seen as emerging from large numbers of
       individually unimportant but carefully hoarded fragments that were not necessarily
       recognized as related to one another at the time they were acquired. Use of stored data is
       intensively interactive; "information retrieval" is an inadequate and even misleading
       metaphor. The analyst is continually interacting with units of stored data as though they
       were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not
       relevant documents, are sought.


Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that

new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes

toward information indistinguishable from attitudes toward research itself."

       Not content to lecture scientists from a theoretical pedestal, by the time these words were

published Swanson had already put the idea into practice by developing a system to discover

meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser,

1999). Software now called ARROWSMITH and freely available on the web

(http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and

noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to

reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal

useful information not apparent in the two sets considered separately" – e.g., one may reveal a
Text Mining 7


natural relationship between A and B, and the other a relationship between B and C, so that

together they suggest a relationship between A and C. The two literatures are "noninteractive" if

their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has

discovered at least three biomedically important relationships using this system: between fish oil

and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin

C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as

potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).

       Swanson's system remains far from fully automated, it is highly medical domain-specific,

and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the

criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self-

described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like

to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper,

Swanson is the father of modern text mining.



                                           What is Text Mining?

       Text mining per se is new and is still defining itself. It "has the peculiar distinction of

having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and

most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor

"implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold

hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the

computer rediscovers information that was encoded in the text by its author" (IBM, 1998b).

       Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes

it from "information access" (traditional IR). Traditional IR is concerned primarily with the
Text Mining 8


retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need,

but getting the desired information out of the documents is left entirely up to the user. According

to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with

the information, it tries to discover or derive new information from the data (text) which was

previously unknown even to the author(s) of the data (text[s]). She says "data mining is

opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering,

finding terms for query expansion, and co-citation analysis are not text mining, although they can

aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique

supporting text mining, rather than its broader term.

       Text mining always involves (a) getting some texts relevant to the domain of interest

(traditional IR); (b) representing the content of the text in some medium useful for processing

(natural language processing, statistical modeling, etc.); and (c) doing something with the

representation (finding associations, dominant themes, etc.) (Perrin, 2001).

       IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et

al, 1999). It is a set of tools which "can be seen as information extractors which enrich

documents with information about their contents" in the form of structured metadata. "Features"

are classes of data which can be extracted, such as the language of the text, proper names, dates,

currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature

extraction component is "fully automatic – the vocabulary is not predefined." It may operate on

single documents or on collections of documents. Word counts are based on normalization to

canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery).

The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of-

speech information for English words [and] simple pattern matching to find expressions having
Text Mining 9


the noun phrase structures characteristic of technical terms. This process is much faster than

alternative approaches." There is also a clustering tool, a classification tool, and a search engine/

web crawler. The clustering similarity measure is based on "lexical affinities" – correlated

groups of words which appear frequently within a short distance of each other and which can be

used to label the clusters.

          Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach

without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves

as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on

two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into

ARROWSMITH, which generates a list of all significant words and phrases common to the two

result sets, and uses this information to "juxtapose pairs of text passages for the user to consider

as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added

lexical frequency statistics (tf*idf) to rank the common words and phrases by probable

discriminatory value, but their system, like Swanson's, still requires "human filters" at several

points.

          Kostoff and co-workers have published several papers on the Web describing various text

mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text

data mining] architecture that unifies information retrieval from text collections, information

extraction from individual texts, knowledge discovery in databases, knowledge management in

organizations, and visualization of data and information." What they mean by "unifies" is

unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for

the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes

subsystems for data collection (source selection and text retrieval), data warehousing
Text Mining 10


(information extraction and data storage), and data exploitation (data mining and presentation).

It thus appears to be a system for extracting and analyzing metadata. The authors discuss

linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long-

range goals. Current work focuses on the more pedestrian challenges of relevance feedback

("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time

and labor intensive" by the authors' own admission, "requires the close involvement of technical

domain experts(s)" at every level of processing, and aims for a "main output [consisting of]

technical experts who have had their horizon and perspectives broadened substantially through

participation in the data mining process. The data mining tools, techniques and tangible products

are of secondary importance…"

       Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database

tomography," a system for phrase extraction and proximity analysis. The authors capture the

spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret

large amounts of technological information semi-autonomously can expand greatly the

capabilities of human beings…" The idea of "tomography" also evokes text visualization, an

important nontraditional IR strategy related to text mining (see below). The authors cite

unpublished studies showing that in "real-world text mining applications" there is a "strong de-

coupling of the text mining research performer from the text mining user. The performer tended

to focus on exotic automated techniques, to the relative exclusion of the components of judgment

necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if

it meant "reading copious numbers of articles." Database tomography aims to couple text mining

research and technology more closely with the user through "heavy involvement of topical

domain experts (either users or their proxies)" in the development of "strategic database maps"
Text Mining 11


on the "front end." "The authors believe that this is the proper use of automated techniques for

text mining: to augment and amplify the capabilities of the expert by providing insights to the

database structure and contents, not to replace the experts by a combination of machines and

non-experts."

       Kostoff and DeMarco (2001) define science and technology text mining as "the

extraction of information from technical literature." It has three components: information

retrieval (gathering relevant documents), information processing, and information integration.

"Information processing is the extraction of patterns from the retrieved records" by bibliometrics,

computational linguistics, and clustering. "Information integration is the synergistic combination

of the information processing computer output with the [human] reading of the retrieved relevant

records. The information processing output serves as a framework for the analysis, and the

insights from reading the records enhance the skeleton structure to provide a logical integrated

product." Again, "substantial manual labor" is noted, and technical details are not given, leaving

doubt as to what kind of and how much "computational linguistics" and "clustering" were

actually implemented. This work was also published under the title "Citation mining: Integrating

text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia,

and Ramirez (2001).

       In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical

jargon and speculation to actual accomplishment. He seems to be re-inventing several well

established techniques such as relevance feedback, co-citation analysis, and phrase extraction,

giving them flashy new names, and failing to cite prior work by others. It is often unclear where

the boundary is between the computer and human filtering, particularly in Kostoff's phrase

extraction process. Given the authors' constant emphasis on the importance of human judgment
Text Mining 12


it seems likely that they have not automated the phrase selection process at all, and therefore

have not added anything to classical word proximity analysis for phrase identification.

Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is,

in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing

and objectifying the IR process, and it is hard to see how it contributes anything progressive to

text mining research. This is not to disagree with Kostoff about the importance of domain

expertise and user credibility and acceptance, only to caution against using such concerns as a

figleaf for excessively primitive IR technology.

       Based on the foregoing, I propose the following criteria for a true text mining system.

The keywords are highlighted.

•   It must operate on large, natural language text collections.

•   It must use principled algorithms more than heuristics and manual filtering.

•   It must extract phenomenological units of information (e.g., patterns) rather than or in

    addition to documents.

•   It must discover new knowledge.

       It is to be expected that different systems will meet these criteria to different extents.

Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly

three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at

some related concepts.



                                                 Data Mining

       It seems fairly noncontroversial that text mining is a subdiscipline of the broader and

slightly older field of data mining, the subdiscipline which deals with textual data. An
Text Mining 13


intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al,

2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise

worthless rock" is actually more appropriate for text mining than for data mining, which tends to

deal with trends and patterns across whole databases (Hearst, 1999).

       Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by

some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most

cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin,

2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and

ultimately understandable patterns in data. "Information archaeology" is a synonym for both data

mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on

data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001).

       Data mining usually deals with structured data, but text is usually fairly unstructured. The

crux of the text mining problem, then, can be viewed as imposing structure on text to make it

amenable to the analytic techniques of data mining. This is often conceptualized as extracting

metadata from text (Losiewicz et al, 2000).



                                              Machine Learning

       Data mining is based on a variety of computational techniques, some of which fall under

the rubric of machine learning. Examples are decision trees, neural networks, and association rules

(clustering). In this context, machine learning involves "the acquisition of structural descriptions

from examples [which] can be used for prediction, explanation, and understanding." When the

description can be used to classify the examples, all three are enabled, unlike purely statistical

modeling which only supports prediction. By some views, however, machine learning is little
Text Mining 14


more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis

on searching "through a space of possible concept descriptions for one that fits the data" (Witten &

Frank, 2000).

       From a broader artificial intelligence (AI) perspective, machine learning is one of the four

capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear

logical, rational, and intelligent to an intelligent human interrogator. In this context machine

learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns"

(Russell & Norvig, 1995).

       From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine

learning is "the study of computer algorithms capable of learning to improve their performance of

a task on the basis of their own previous experience" primarily through pattern recognition and

statistical inference. They see a legitimate future role for it in "every element of scientific method,

from hypothesis generation to model construction to decisive experimentation." Text mining

could help with the "high data volumes" involved in literature searching. However, most work to

date has focused on experimental data reduction such as visualization of high-dimensional vector

data resulting from gene expression microarray studies (see footnote 6, p. 25).



                                        Natural Language Processing

       Natural language processing (NLP) or understanding (NLU) is the branch of linguistics

which deals with computational models of language. A brief history is given by Bates (1995).

Its motivations are both scientific (to better understand language) and practical (to build

intelligent computer systems). NLP has several levels of analysis: phonological (speech),

morphological (word structure), syntactic (grammar), semantic (meaning of multiword
Text Mining 15


structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of

multi-sentence structures), and world (how general knowledge affects language usage) (Allen,

1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector

space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle

with meaning. NLP can differentiate how words are used such as by sentence parsing and part-

of-speech tagging, and thereby might add discriminatory power to statistical text analysis.

Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is

widespread but the jury remains out.

       Rau (1988) described an early NLP system named SCISOR which was developed by

General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was

programmed to deal only with information on corporate mergers. Input (news stories, etc.) was

described as being converted to "conceptual format" permitting natural language interrogation

(i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down

(expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing.

Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences.

Computerized parsing of free text "is an extremely difficult and challenging problem," according

to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base

containing grammatical and lexical information. The double parsing strategy of SCISOR

allowed flexibility to perform in-depth analysis when complete grammatical and lexical

knowledge is available, and superficial analysis when unknown words and syntax are

encountered, giving the system robustness. The top-down parser could also be used for text

skimming (looking for particular pieces of information).

       However, semantic analysis "is very expensive and furthermore depends on a lot of
Text Mining 16


domain-dependent knowledge that has to be constructed manually or obtained from other sources"

(IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based

indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton,

1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came

of age and it was realized that the limitations of the linguistic techniques did not prevent them from

being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more

successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals

and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus

that NLP is still not competitive with statistical approaches to traditional IR, but that it may be

practical and even critical for applications such as phrase extraction and text summarization. Even

Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that

are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text

summaries" (Salton, Allan, Buckley, & Singhal, 1994).

        Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her

definition of the goal of text mining, in fact, is "capturing semantic information" as tabular

metadata amenable to statistical data mining techniques. In her work, NLP includes stemming

(morphological level), part-of-speech tagging (syntactic level), phrase and proper name

extraction (semantic level), and disambiguation (discourse level). Goals include automating text

mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text

classification (see below).

        A "reverse flow" of purely statistical methods to NLP has been going on since about

1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid

approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown
Text Mining 17


to significantly improve the accuracy of proper name classification, part-of-speech tagging, word

sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and

disambiguation improve probabilistic document retrieval ranking discrimination by some parts of

speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which

in turn reflect natural languages' relation to "naturally occurring dependencies in the physical

world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like

stemming and query expansion in improving the performance of an advanced IR system under

rigorous test conditions (Perez-Carballo & Strzalkowski, 2000).

       Computational linguistics is used as a synonym for NLP by some writers and as a

narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with

finding statistical patterns in large text collections to inform algorithms for NLP techniques such

as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e.,

computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining

subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being

the bridge between NLP and statistics. They both envision text mining as a component of a full-

featured information access system which also includes source detection, content retrieval, and

analytical aids such as text visualization (see below).

       A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives

(this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal,

1993). Therefore a good job for NLP would be to detect anaphors and search backwards to

resolve their referent. In the language of logic, this might be called identifying the point in the

text where each significant new proposition begins. In 1993, that was beyond available text

processing capabilities, so the authors had to exclude anaphoric sentences from further analysis
Text Mining 18


regardless of their information content.

        In summary, all this activity and interest raise hopes, but NLP still "has not delivered the

goods" (Saracevic, 2001) and so the jury remains out.



                                        Text Summarization

        An obvious example of text mining would be to find previously unknown natural

correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that,

of course, one must identify the themes. A theme being a form of summary, automated theme-

finding is a form of automatic text summarization (or automatic abstracting), a proud old IR

tradition.

        Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation

from Luhn (1958), who proposed extracting sentences based on their computed word content

weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the

importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et

al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus

words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's

idea of cues to "indicator constructs" such as In this paper we show that…

        Johnson et al (1993) built a NLP-based auto-abstracting system which selected non-

anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary-

based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.).

Each word was then indexed by its sentence number, position within the sentence, part of speech,

verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned
Text Mining 19


up" by a set of corrective heuristics and a grammar-based tag disambiguator3. A global parser

then identified noun phrases based on definitive cues such as being separated by a preposition

(e.g., the primary factor in public health), and then parsed the sentence. The resulting sample

abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down

to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most

text summarization needs, the next step might be to take a page from statistical IR and develop

ways of ranking the selected sentences.



                                               Template mining

        SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values

specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B

dollars per share in a takeover bid for company C on date D. The values were extracted from

raw text by parsing and stored in relational data tables. Then summaries of the parsed data

values could be written by a natural language generator. This seems to be a form of template

mining, where the script or metadata table field structure constitutes the template.

        Chowdhury (1999) describes template mining as a form of information extraction using

NLP "to extract data directly from the text if either the data and/or text surrounding the data form

recognizable patterns. When text matches a template, the system extracts data according to the

instructions associated with that template." Chowdury traces its history from the mid-1960s

Linguistic String Project at New York University, where "fact retrieval" was conducted against

template data mined from natural language text, up to its current (1999) use in the AltaVista and



3
  An example of a sentence with intractable tag ambiguity would be Rice flies like sand, which could refer to the
behavior of grain or insects (Allen, 1995, p. 13). Such a sentence would require higher (pragmatic and discourse)
levels of analysis to disambiguate.
Text Mining 20


Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and

below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a

general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in

reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the

Message Understanding Conferences (MUCs).

       To facilitate template mining, Chowdhury recommends "standardization in the

presentation and layout of information within digital documents" through the use of templates for

document creation. But this is contrary to the spirit of text mining, which is to liberate both the

creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's

unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty –

hopefully premature!



                                           Theme Finding

       Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be

applied to theme generation and text summarization. The authors derived the notion of passage

retrieval from the problem of ranking vector matches when the vectors are of different lengths,

e.g. very short queries against long documents, or clustering documents of different sizes. One

solution is to decompose the documents into subunits of roughly equal size, called "passages." A

common passage unit is a paragraph.

       The passages may be converted to normalized vectors and compared. Those with

similarities above a certain threshold (which may be chosen to deliver a desired degree of

abstraction) are considered connected. If the documents are plotted as arcs on the circumference

of a circle and their component passages connected by straight lines in accordance with their
Text Mining 21


vector similarities, the resulting starburst pattern can convey themes within and between

documents. These themes can be focused by expressing each triangle of passage similarities

as a centroid and doing similarity calculations on the centroids.

        One may want to compute an estimate of the "most important" passages for the purpose

of selective text traversal ("skimming") or text summarization. Such passages might be

identified as (a) having a large number of above-threshold similarity connections, (b) strategic

position (e.g., the first paragraph in each section), or (c) high similarity to some reference node.

The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be

combined; e.g., start with some desired passage (as in "more like this"), go to the most similar

sectional heading passage, then go to its strongest link, the select the other densely connected

nodes in that cluster in chronological order. For text summarization, repetition can be edited out

on the basis of similarities between sentences or other subunits which are "too high."



                                          Text Categorization

        Text categorization should not be considered a form of text mining because it is a

"boiling down" of document content to "pre-defined labels" which "does not lead to discovery of

new information" since "presumably the person who wrote the document knew what it was

about," according to Hearst (1999). Presumably she would also rule out text summarization and

auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of

categorization is to find "unexpected patterns" or "new events" because these "tell us something

about the world, outside of the text collection itself" and therefore qualify as new information.

        I would argue, however, that it is not so easy to predict where "new information" will

come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a
Text Mining 22


form of separating "precious nuggets" from "worthless rock" according to the human

idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or

a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all

text mining, but just to highlight the fuzziness of the boundaries between them.



                                              Clustering

       Clustering can be used to classify texts or passages in natural categories that arise from

statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of

traditional manual indexing systems. In the context of text mining, it is the derivation of the

categories which is of interest, since this is a form of theme finding and therefore text

summarization. Once the texts are clustered on the basis of common themes, it may also be useful

to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of

length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of

origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank

(2000, Section 6.6).



                                                   Filtering

       E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank,

2000). The relevance of related techniques such as name recognition, theme finding, and text

categorization are obvious, and it is even possible to imagine software which modifies its own

filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable

to find reports of any actual work on such a system.

       Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's
Text Mining 23


famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the

two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term…

regular information interests" of IF compared to the "periodic… information need or ASK" of

IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at

another striking comparison: the IF network looks exactly like an upside-down IR network! That

is, in IR multiple documents are percolating down to a single user, while in IF each single

incoming document is percolating down to multiple users. However, the authors reject this

analogy for reasons not entirely clear to me.4



                                              Text Visualization

        Text visualization shares text mining's goals of using computational transformations to

reduce the cognitive effort of dealing with large text corpora, highlight patterns across

documents, and help discover new knowledge. Text mining implies homing in on "precious

nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice

both may be regarded as elements of a holistic approach to multi-text corpora. The text mining

systems of Hearst, Kostoff, and Liddy all have explicit text visualization components.

        Wise (1999) developed a text visualization paradigm for intelligence analysis named

Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of

‘visualizing text’ in order to reduce information processing load and to improve productivity" by

representing large numbers of documents to permit "rapid retrieval, categorization, abstraction,

and comparison, without the requirement to read them all." The theory behind SPIRE was that



4
 They seem to feel that "P(oj|pi)", the probability that the incoming document will satisfy the information need
given a user's filtering profile, is poorly understood compared to the conventional Bayesian need-query-document
relationships, but I'm not sure the latter are so well-understood, either.
Text Mining 24


humans’ most highly evolved perceptual abilities are those involved in interpreting "visual

features of the natural world." Therefore the goal was to represent text as natural, ecological

images from our early hominid past which require no "prolonged training to appreciate and use"

such as star fields or landscapes (Figure 1). This transformation was accomplished using

standard vector space algorithms and involves clustering and text summarization. SPIRE is an

excellent example of how a cognitive theory can be helpful in inspiring IR innovation and

guiding system development, despite its apparent lack of commercial success.5



                                            Text Compression

        As mentioned at the beginning, I started this paper by trying to narrow the definition and

scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One

by one, however, the other strategies refused to be cleanly differentiated, and the foregoing

polyglot review is the result. The only concept I thought I had succeeded in banishing from the

scope of text mining was data compression, which showed up in the title of a single citation in a

literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was

surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was

something I could confidently rule out.

        But on page 334, Witten and Frank (2000), in discussing statistical character-based

models for token classification (names, dates, money amounts, etc.), note that "there is a close

connection with prediction and compression: the number of bits required to compress an item

with respect to a model can be interpreted as the negative logarithm of the probability with which

that item is produced by the model." That is, text compression algorithms might function as


5
 Cartia, Inc., which was marketing the ThemeScape™ software (Figure 2, downloaded Fall 2000), no longer has
any detectable presence on the Web.
Text Mining 25


token classifiers in reverse! So I give up. Text mining appears to be related to just about

everything on my original list.



                                      Biomedical Applications

       My interest in text mining is motivated primarily by the belief that it can be fruitfully

applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge.

I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of

gene sequence analysis is based, after all, on nothing more than algorithms for finding and

comparing patterns in the four-letter language of DNA. Swanson's work has focused on

MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of

the function of newly sequenced genes" by determining which novel genes are "co-expressed with

already understood genes which are known to be involved in disease."

       Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as

"extracting information about predefined classes of entities and relationships from natural

language texts and placing this information into a structured representation called a template" [is it

therefore template mining?], to build a database of information about enzymes, metabolic

pathways, and protein structure from full text biomedical research articles. The LaSIE (Large

Scale Information Extraction) system includes modules for datatype recognition (names, dates,

etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template

filling. It does linguistic analysis at all levels up to discourse using lexical knowledge,

morphology, and grammars to identify significant words. The enzyme and metabolic pathway

variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme

name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles
Text Mining 26


(substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields

include concentration and temperature. The PASTA variant deals with protein structure

information such as which amino acid residues occupy given positions, active and binding sites,

secondary structure, subunits, interactions with other molecules, source organism, and SCOP

category. The prototype has been tested on only six journal papers, so it is far from satisfying the

large text corpus requirement for true text mining, but the authors make no such claim.

        The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf,

Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort

out the thousands of gene expression correlations resulting from microarray experiments6 to

separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The

first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an

Israeli gene information database) on the expressed genes. [Gene names generally make good

search words because they are different from normal English words, e.g. "JAK3".] The second

module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity

or term frequency scores (NLP criteria being regarded as too computationally expensive). The

third module is a "carefully designed user interface" to facilitate access to the most likely-to-be-

interesting documents.

        Despite the name, then, MedMiner is not a true text mining system, but rather a search and

display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval,

and no integration with GeneCards, although it is integrated with other gene and protein

databases). Like Kostoff's system, it is designed to deal with highly technical information by

assisting expert users in their traditional IR tasks rather than attempting to automate them
6
  Basically, a square chip coated with an array of known DNA sequences at known locations on the chip is dipped
into a broth containing the expressed messenger RNA (mRNA) from cells under given conditions. The mRNA is
labeled so that when it binds to its complementary DNA on the chip the gene expression pattern is revealed. Gifford
(2001) briefly reviewed the direct application of data visualization to gene expression data not involving any text.
Text Mining 27


completely. MedMiner is freely available online at http://discover.nci.nih.gov.

        Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP

system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER

attempts to identify noun phrases representing molecular entities such as drugs, receptors,

enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain,

sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing,

the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and

GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with

acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously

unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and

demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's

worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms.

While terminology extraction might be considered a fairly trivial form of text mining, it is

obviously a logical step toward the mining of binding relationships (A binds B) which would have

enormous potential for knowledge discovery.

       Stapley and Benoit (2000) developed a system named “BioBiblioMetrics” (Stapley,

2000) which uses text visualization to suggest functional clusters of genes from the yeast

Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the

yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases

from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence

statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co-

occurring genes are displayed in a graph with “nodes” representing genes and edge lengths

between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to
Text Mining 28


sequence databases, and edges to those MEDLINE documents that generated them, creating a

biomedical information “landscape” and inference network. BioBiblioMetrics is freely available

online at http://www.bmm.icnet.uk/~stapleyb/biobib/.

       Other MEDLINE text mining papers which I did not have a chance to review in full

involve dictionary-controlled natural language processing for extraction of drug-gene relationships

(Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur &

Yang, 1996); statistical text classification and a relational machine-learning method (Craven &

Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family

background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of

action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information

extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a)

provides online full-text access to many biomedical text mining papers, including those from the

hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing.

       Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and

enunciated a radical vision which includes several of the themes of this paper, such as the

analogy between text and genome analysis, and the long history of information extraction in its

many guises. He see the challenge as "understanding the nature of biological text, whatever that

turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules

and grammars of Chomskian linguistics are more hindrance than help.

       Frankly, a fresh new approach is needed, fueled by the conviction that language is a
       biological phenomenon, not a logical phenomenon. By this we mean that the nature of
       language is as messy as the genome. The data and observed phenomena in all their richness
       and variety are dominant and cannot subsumed by any elegant theories. This means that in
       many ways, biologists have far better hopes of cracking the NLP problem than the
       computational linguists, who are focused on mathematics and logic. Even when they look
       at data, it is primarily as grist for their math mills.
Text Mining 29


Futrelle recommends, for example, building visualization tools such as a protein noun phrase

highlighter which could be used to "assemble a large collection of the standard textual

expression forms [and] map these onto the query forms for which they are the answers."

       But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a

coherent theory based on the biological nature of language.

       By this I mean that language is a communicative capability of living organisms that has
       evolved from deep biological roots and from social interactions over millions, and
       ultimately, billions of years. I claim that language is not logical and mathematical,
       because that's not the nature of the organism (us) that exhibits the language capability.
       An example of this is found in our vocabularies. A technically skilled adult will have a
       vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or
       "ship" does not follow from the characters that make them up. We simply commit them
       to memory. Linguists would like us to believe that our natural ability to "parse" is
       radically different and can be explained as a rule-based system.

       My radical view is that we understand language not by generalization to abstract rules as
       much as by retaining examples and generalizing from them as needed. This is quite
       within our capacity, given our 100,000 word vocabularies. We also do reason. I would
       claim, again in the biological view, that this is done more by "imagined life" than by
       logic. Humans have superb abilities to remember events and to build detailed mental
       plans for future activities …. So we need to build this type of reasoning into our systems.

       The analogy to genomics is clear. The coding of a particular protein by a particular

sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail

(such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking

for patterns within the data. Purely logical approaches must wait for a richer knowledge base.

Only now, after the massive effort of half a century of molecular genetic research, sequencing

whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome,

can we begin to think about prediction of protein structure and function from sequence data

alone. Biological linguistics now stands at the beginning of a comparably arduous journey.

       These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on

human expertise and manual filtering in a better light. Perhaps they do not represent premature
Text Mining 30


surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they

are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn

what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text

mining and the cognitive tradition in IR would make a worthy topic for another paper.
Text Mining 31


                                            References

       Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA:

Benjamin/Cummings.

       Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences

by extraction of keywords from MEDLINE abstracts. Development of a prototype system.

Proceedings of the international conference on intelligent systems for molecular biology 5:25-32.

       Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from

scientific text: application to the knowledge domain of protein families. Bioinformatics

14(7):600-607.

       Bates, M. (1995). Models of natural language understanding. Proceedings of the

National Academy of Sciences, 92, 9977-9982.

       Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval:

Two sides of the same coin? Communications of the ACM, 35, 29-38.

       Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract-

ion of biological information from scientific text: protein-protein interactions. Proceedings of

the international conference on intelligent systems for molecular biology, pp.60-67.

       Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108.

       Cartia, Inc. (2000). ThemeScape product suite. Formerly online: http://www.cartia.com/

products/index.html [no longer accessible].

       Chowdhury, G. G. (1999). Template mining for information extraction from digital

documents. Library Trends, 48, 182-208.

       Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by

extracting information from text sources. Proceedings of the International Conference on
Text Mining 32


Intelligent Systems for Molecular Biology, pp.77-86.

       Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of

textual data. KDD-99, Association of Computing Machinery.

       Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the

Association for Computing Machinery, 8, 223-239.

       Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the

WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html

       Futrelle, R. P. (2001a). Natural language processing of biology texts. Online:

http://www.ccs.neu.edu/home/futrelle/bionlp/

       Futrelle, R. P. (2001b). The past, present and future of biology text understanding.

Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli

Gardens, Copenhagen, Denmark, July 26. Online:

http://www.ccs.neu.edu/home/futrelle/brie2001/index.html

       Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293,

2049-2051.

       Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html

       Hearst, M. (1997). Distinguishing between web data mining and information access.

Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA.

Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm

       Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th

Annual Meeting of the Association for Computational Linguistics, University of Maryland, June

20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-

tdm.html
Text Mining 33


          Hearst, M. (2001). About TextTiling. Online:

http://www.sims.berkeley.edu/~hearst/tiling-about.html

          Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of

information extraction for scientific journal articles. Journal of Information Science, 26, 75-85.

          IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview.

Online:

http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/im4t23over8.htm

          IBM (1998b). Text mining technology: Turning information into knowledge: A white

paper from IBM. Daniel Tkach (Ed.). Online:

http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf

          Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive

approaches for information retrieval. Libri, 45, 160-177.

          Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of

linguistic processing to automatic abstract generation. Journal of Document and Text

Management, 1, 215-241.

          Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class,

Rutgers University, School of Communication, Information, and Library Studies, New

Brunswick, NJ.

          Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online:

http://www.dtic.mil/dtic/kostoff/Swanson2.txt

          Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific

literature with text mining. Analytical Chemistry (in press). Online:

http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt
Text Mining 34


       Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001).

Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of

the American Society for Information Science, 52, 1148-1156.

       Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining

using database tomography and bibliometrics: A review. Online:

http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm

       KRDL (2001). Text mining: transforming raw text into actionable knowledge (white

paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/

       Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief

survey of web data extraction tools. In press.

       Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information

Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html

       Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop

on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and

Theoretical Computer Science, Rutgers University, New Brunswick, NJ.

       Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics.

Journal of the American Society for Information Science, 50, 574-587.

       Losee, R. M. (2001a). Natural language processing in support of decision-making:

phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.

       Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of

the American Society for Information Science, 52, 1019-1025.

       Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support

science and technology management. Online:
Text Mining 35


http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm

       Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of

Research and Development, 2, 159-165.

       Marcus, M. (1995). New trends in natural language processing: Statistical natural

language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059.

       Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and

future prospects. Science, 293, 2051-2055.

       Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval:

Progress report. Information Processing and Management, 37, 155-178.

       Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck &

Co., Inc., Rahway, NJ.

       Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis.

Bulletin of the American Society for Information Science, 27. Online:

http://www.asis.org/Bulletin/Oct-00/qin.html

       Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language

input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique

Documentaire, 1997, General Electric, USA.

       Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding

terminology from biomedical text. Proceedings of the American Medical Informatics

Association Symposium, 1999, 127-131. Online:

http://www.amia.org/pubs/symposia/D005564.PDF

       Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction

of drugs, genes and relations from the biomedical literature. Pacific Symposium on
Text Mining 36


Biocomputing, 2000, 517-528.

       Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper

Saddle River, NJ: Prentice Hall.

       Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and

Management, 28, 441-449.

       Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme

generation, and summarization of machine-readable texts. Science, 264, 1421-1426.

       Saracevic, T. (2001). Personal communication and class discussions, Seminar in

Information Studies, Rutgers University, School of Communication, Information and Library

Studies, New Brunswick, NJ.

       SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International

Conference on Data Mining, Arlingon, VA, April 13, 2002. Online:

http://www.cs.utk.edu/tmw02/

       Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings:

identification of findings in medical literature using restricted natural language processing.

Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996,

239-243.

       Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available: http://www.bmm.icnet.uk/

~stapleyb/biobib/

       Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and

visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on

Biocomputing, 2000, 529-540.

       Swanson, D. R. (1988). Historical note: Information retrieval and the future of an
Text Mining 37


illusion. Journal of the American Society for Information Science, 39, 92-98.

       Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding

complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91,

183-203.

       Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline

records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51.

       Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from

complementary literatures: Categorizing viruses as potential weapons. Journal of the American

Society for Information Science and Technology, 52, 797-812.

       Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999).

MedMiner: An Internet text-mining tool for biomedical information, with application to gene

expression profiling. BioTechniques, 27, 1210-1217.

       Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic

extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing,

2000, 541-552.

       Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in

the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine,

26(3):209-222.

       Wise, J. A. (1999). The ecological approach to text visualization. Journal of the

American Society for Information Science, 50(13):1224-1233.

       Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and

Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).
Text Mining 38


Table 1.

Initial List of Information Retrieval (IR) Concepts Related to Text Mining.



IR concept                           Authority (see References)

Artificial intelligence              Fan; Perrin

Bioinformatics                       Futrelle; Perrin

Citation mining                      Kostoff

Computational Linguistics            Fan; Hearst

Conceptual Graphs                    KRDL

Data Abstraction                     Fan

Data Mining                          Fan; Perrin; SDM

Database Tomography                  Kostoff

Document Mining                      Fan

Domain Knowledge                     KRDL

Electronic Commerce                  Fan

Factor Analysis                      SDM

Information Access                   Hearst

Information Extraction               Chowdhury; Fan; Futrelle; Kostoff; Perrin

Information filtering                Fan

Information Integration              Fan

Information Retrieval                Fan; Perrin

Information Visualization/Mapping Futrelle; Fan; SDM

Intelligent Agents ("bots")          Fan
Text Mining 39


Knowledge Discovery                  Fan

Knowledge Extraction                 Perrin

Knowledge Representation             Perrin

Language Identification              IBM

Machine Learning                     Fan; Futrelle; Perrin

Metadata Generation                  SDM

Natural language processing          Fan; Futrelle; Perrin; Rindflesch; Saracevic

Ontologies/Vocabularies/Lexicons     Futrelle

Phrase Extraction                    Fan

Question Answering                   Futrelle

Resource Discovery                   Fan

Resource Indexing                    Fan

Semantic Modeling                    Perrin; SDM

Semantic Processing                  Rindflesch

Statistical Language Modeling        Fan

Stemming                             SDM

Syntactic Processing                 Saracevic

Template Mining                      Chowdhury; KRDL

Text Analysis                        Futrelle; IBM

Text Classification/Categorization   Fan; Hearst (distinct); IBM; SDM

Text Clustering                      Fan; IBM

Text Data Mining                     Hearst; Kostoff

Text Parsing                         SDM
Text Mining 40


Text Purification                SDM

Text Segmentation/"TextTiling"   Hearst; SDM

Text Summarization               Futrelle; IBM; Saracevic; SDM

Text Understanding               Futrelle; Fan

Web Data Mining                  Hearst

Web Mining                       Fan

Web Utilization Mining           Fan
Text Mining 41




Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents
(Cartia, 2000, expired website).
Text Mining 42




Figure 2. BioBiblioMetrics retrieval from a search on “DNA repair” and “recombination”
(Stapley, 2000).

Weitere ähnliche Inhalte

Ähnlich wie text_mining.doc

Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSearch, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSeth Grimes
 
The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework Neuroscience Information Framework
 
Relationship of information science with library science
Relationship of information science with library scienceRelationship of information science with library science
Relationship of information science with library scienceSadaf Batool
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebRutger Vos
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesJenny Molloy
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesMartin Donnelly
 
PARTHENOS - Introduction to Infrastructures
PARTHENOS - Introduction to InfrastructuresPARTHENOS - Introduction to Infrastructures
PARTHENOS - Introduction to InfrastructuresParthenos
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1Sumit Sony
 
INFORMATION MARKETING
INFORMATION MARKETINGINFORMATION MARKETING
INFORMATION MARKETINGharshaec
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCEharshaec
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainAngelo Salatino
 
Harsha, information
Harsha, informationHarsha, information
Harsha, informationharshaec
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018SusanMRob
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitationbutest
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitationbutest
 
Finding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemFinding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemEdwin Henneken
 
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesA Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesJeff Brooks
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCELibcorpio
 

Ähnlich wie text_mining.doc (20)

Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSearch, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled Vision
 
The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework
 
Relationship of information science with library science
Relationship of information science with library scienceRelationship of information science with library science
Relationship of information science with library science
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic Web
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social Sciences
 
PARTHENOS - Introduction to Infrastructures
PARTHENOS - Introduction to InfrastructuresPARTHENOS - Introduction to Infrastructures
PARTHENOS - Introduction to Infrastructures
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
INFORMATION MARKETING
INFORMATION MARKETINGINFORMATION MARKETING
INFORMATION MARKETING
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCE
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domain
 
Harsha, information
Harsha, informationHarsha, information
Harsha, information
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
 
Finding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemFinding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender System
 
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesA Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
 
Informatics: Introduction
Informatics: IntroductionInformatics: Introduction
Informatics: Introduction
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

text_mining.doc

  • 1. Text Mining 1 Running head: TEXT MINING Text Mining Mark Sharp Rutgers University, School of Communication, Information and Library Studies
  • 2. Text Mining 2 Abstract The general idea of text mining – getting small "nuggets" of desired information out of "mountains" of textual data without having to read it all – is nearly as old as information retrieval (IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be viewed as one of a class of nontraditional IR strategies which attempt to treat entire text collections holistically, avoid the bias of human queries, objectify the IR process with principled algorithms, and "let the data speak for itself." These strategies share many techniques such as semantic parsing and statistical clustering, and the boundaries between them are fuzzy. Therefore in this paper several related concepts are briefly reviewed in addition to text mining proper, including data mining, machine learning, natural language processing, text summarization, template mining, theme finding, text categorization, clustering, filtering, text visualization, and text compression. Current text mining systems per se appear to be fairly primitive, but to have the following goals which may serve as a useful definition to distinguish text mining from other IR concepts: (1) to operate on large, natural language text collections; (2) to use principled algorithms more than heuristics and manual filtering; (3) to extract phenomenological units of information (e.g., patterns) rather than or in addition to documents; (4) to discover new knowledge. Interest in text mining for biomedical research purposes is especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining systems designed for use with science and technology text databases such as MEDLINE currently seem to have an undue emphasis on expert human filtering which contradicts goal (2). Whether this represents premature surrender to difficulty or a necessary temporary expedient remains to be seen.
  • 4. Text Mining 4 Text Mining Why Text Mining? It has become a cliché to describe information space and the challenge of navigating it in dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with regard to scientific, technical, and scholarly literature. We moderns may like to think we are the first to face this problem, but scientists have always complained about keeping up with their literature (Saracevic, 2001). The promise of better science through better information technology has been a major theme in information science since Vannevar Bush (1945) proposed his famous Memex machine to deal with the "growing mountain of research." Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and difficult to deal with" but also "the most common vehicle for formal exchange of information." Therefore, the "motivation for trying to extract information from it is compelling – even if success is only partial …. Whereas data mining belongs in the corporate world because that's where most databases are, text mining promises to move machine learning technology out of the companies and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as "web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a current review of web data extraction tools. Text mining is one of a class of what I will call "nontraditional information retrieval (IR) strategies." The goal of these strategies is to reduce the effort required of users to obtain useful information from large computerized text data sources. Traditional IR often simultaneously retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas, 2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the
  • 5. Text Mining 5 entire database or collection more holistically, recognizing that the selectivity of anthropogenic queries has a downside or bias which can be counterproductive to obtaining the best information, and attempting to "objectify" the IR process with principled algorithms.1 I like to think that they try to "let the data speak for itself." When I started to research this paper I made a list of all the IR concepts (traditional and non-) that were explicitly related to text mining by the first wave of authorities I identified. It was a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and thus define their boundaries and hierarchical relationships to text mining. However, it soon became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of truth.2 Therefore I decided to try to cover them all instead of focusing on text mining proper, whatever that turned out to be. Fortunately, time and literature resource limitations intervened to significantly curtail this plan. Hopefully the result will serve as a sensible compromise. History of Text Mining H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text mining and related methods when he said that "natural characterization and organization of information can come from analysis of frequencies and distributions of words in libraries" 1 E.g., "'Objectivity' [means] the results solely depend on the outcome of the linguistic processing algorithms and statistical calculations" (Dorre, Gerstl, & Seiffert, 1999). I recognize that such computational exotica, stripped of their mathematical mystique, "can be regarded as a form of transformed cognitive structure" (Ingwersen & Willett, 1995) and are therefore ultimately just as human and arbitrary as the traditional methods. But I also believe that there can be degrees of objectivity (operationally defined as general validity or utility) and that in general abstract computational approaches will tend to be more objective. 2 There is one website, however, that goes too far. Greenfield (2001) lists virtually every text processing and database technology I have ever heard of under the title "Text Mining." As a kind of rite of passage into the subject, Patrick Perrin asked me to look at it and tell him if all of that was really text mining, so apparently it's somewhat notorious in the field.
  • 6. Text Mining 6 ("libraries" representing what we would now more generally call collections or corpora). Text mining per se may be new, but the dream of training a computer to extract information from "mountains" of textual data is nearly as old as IR itself. Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted scientists' attitudes toward information usage with those of intelligence analysts. 'To the working scientist or engineer, time spent gathering information or writing reports is often regarded as a wasteful encroachment on time that would otherwise be spent producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence analyst, by contrast, is much more intimate with the available base of recorded information. New knowledge, or finished intelligence, is seen as emerging from large numbers of individually unimportant but carefully hoarded fragments that were not necessarily recognized as related to one another at the time they were acquired. Use of stored data is intensively interactive; "information retrieval" is an inadequate and even misleading metaphor. The analyst is continually interacting with units of stored data as though they were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not relevant documents, are sought. Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes toward information indistinguishable from attitudes toward research itself." Not content to lecture scientists from a theoretical pedestal, by the time these words were published Swanson had already put the idea into practice by developing a system to discover meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser, 1999). Software now called ARROWSMITH and freely available on the web (http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal useful information not apparent in the two sets considered separately" – e.g., one may reveal a
  • 7. Text Mining 7 natural relationship between A and B, and the other a relationship between B and C, so that together they suggest a relationship between A and C. The two literatures are "noninteractive" if their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has discovered at least three biomedically important relationships using this system: between fish oil and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001). Swanson's system remains far from fully automated, it is highly medical domain-specific, and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self- described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper, Swanson is the father of modern text mining. What is Text Mining? Text mining per se is new and is still defining itself. It "has the peculiar distinction of having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor "implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the computer rediscovers information that was encoded in the text by its author" (IBM, 1998b). Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes it from "information access" (traditional IR). Traditional IR is concerned primarily with the
  • 8. Text Mining 8 retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need, but getting the desired information out of the documents is left entirely up to the user. According to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with the information, it tries to discover or derive new information from the data (text) which was previously unknown even to the author(s) of the data (text[s]). She says "data mining is opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering, finding terms for query expansion, and co-citation analysis are not text mining, although they can aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique supporting text mining, rather than its broader term. Text mining always involves (a) getting some texts relevant to the domain of interest (traditional IR); (b) representing the content of the text in some medium useful for processing (natural language processing, statistical modeling, etc.); and (c) doing something with the representation (finding associations, dominant themes, etc.) (Perrin, 2001). IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et al, 1999). It is a set of tools which "can be seen as information extractors which enrich documents with information about their contents" in the form of structured metadata. "Features" are classes of data which can be extracted, such as the language of the text, proper names, dates, currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature extraction component is "fully automatic – the vocabulary is not predefined." It may operate on single documents or on collections of documents. Word counts are based on normalization to canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery). The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of- speech information for English words [and] simple pattern matching to find expressions having
  • 9. Text Mining 9 the noun phrase structures characteristic of technical terms. This process is much faster than alternative approaches." There is also a clustering tool, a classification tool, and a search engine/ web crawler. The clustering similarity measure is based on "lexical affinities" – correlated groups of words which appear frequently within a short distance of each other and which can be used to label the clusters. Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into ARROWSMITH, which generates a list of all significant words and phrases common to the two result sets, and uses this information to "juxtapose pairs of text passages for the user to consider as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added lexical frequency statistics (tf*idf) to rank the common words and phrases by probable discriminatory value, but their system, like Swanson's, still requires "human filters" at several points. Kostoff and co-workers have published several papers on the Web describing various text mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text data mining] architecture that unifies information retrieval from text collections, information extraction from individual texts, knowledge discovery in databases, knowledge management in organizations, and visualization of data and information." What they mean by "unifies" is unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes subsystems for data collection (source selection and text retrieval), data warehousing
  • 10. Text Mining 10 (information extraction and data storage), and data exploitation (data mining and presentation). It thus appears to be a system for extracting and analyzing metadata. The authors discuss linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long- range goals. Current work focuses on the more pedestrian challenges of relevance feedback ("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time and labor intensive" by the authors' own admission, "requires the close involvement of technical domain experts(s)" at every level of processing, and aims for a "main output [consisting of] technical experts who have had their horizon and perspectives broadened substantially through participation in the data mining process. The data mining tools, techniques and tangible products are of secondary importance…" Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database tomography," a system for phrase extraction and proximity analysis. The authors capture the spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret large amounts of technological information semi-autonomously can expand greatly the capabilities of human beings…" The idea of "tomography" also evokes text visualization, an important nontraditional IR strategy related to text mining (see below). The authors cite unpublished studies showing that in "real-world text mining applications" there is a "strong de- coupling of the text mining research performer from the text mining user. The performer tended to focus on exotic automated techniques, to the relative exclusion of the components of judgment necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if it meant "reading copious numbers of articles." Database tomography aims to couple text mining research and technology more closely with the user through "heavy involvement of topical domain experts (either users or their proxies)" in the development of "strategic database maps"
  • 11. Text Mining 11 on the "front end." "The authors believe that this is the proper use of automated techniques for text mining: to augment and amplify the capabilities of the expert by providing insights to the database structure and contents, not to replace the experts by a combination of machines and non-experts." Kostoff and DeMarco (2001) define science and technology text mining as "the extraction of information from technical literature." It has three components: information retrieval (gathering relevant documents), information processing, and information integration. "Information processing is the extraction of patterns from the retrieved records" by bibliometrics, computational linguistics, and clustering. "Information integration is the synergistic combination of the information processing computer output with the [human] reading of the retrieved relevant records. The information processing output serves as a framework for the analysis, and the insights from reading the records enhance the skeleton structure to provide a logical integrated product." Again, "substantial manual labor" is noted, and technical details are not given, leaving doubt as to what kind of and how much "computational linguistics" and "clustering" were actually implemented. This work was also published under the title "Citation mining: Integrating text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia, and Ramirez (2001). In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical jargon and speculation to actual accomplishment. He seems to be re-inventing several well established techniques such as relevance feedback, co-citation analysis, and phrase extraction, giving them flashy new names, and failing to cite prior work by others. It is often unclear where the boundary is between the computer and human filtering, particularly in Kostoff's phrase extraction process. Given the authors' constant emphasis on the importance of human judgment
  • 12. Text Mining 12 it seems likely that they have not automated the phrase selection process at all, and therefore have not added anything to classical word proximity analysis for phrase identification. Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is, in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing and objectifying the IR process, and it is hard to see how it contributes anything progressive to text mining research. This is not to disagree with Kostoff about the importance of domain expertise and user credibility and acceptance, only to caution against using such concerns as a figleaf for excessively primitive IR technology. Based on the foregoing, I propose the following criteria for a true text mining system. The keywords are highlighted. • It must operate on large, natural language text collections. • It must use principled algorithms more than heuristics and manual filtering. • It must extract phenomenological units of information (e.g., patterns) rather than or in addition to documents. • It must discover new knowledge. It is to be expected that different systems will meet these criteria to different extents. Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at some related concepts. Data Mining It seems fairly noncontroversial that text mining is a subdiscipline of the broader and slightly older field of data mining, the subdiscipline which deals with textual data. An
  • 13. Text Mining 13 intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al, 2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise worthless rock" is actually more appropriate for text mining than for data mining, which tends to deal with trends and patterns across whole databases (Hearst, 1999). Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin, 2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. "Information archaeology" is a synonym for both data mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001). Data mining usually deals with structured data, but text is usually fairly unstructured. The crux of the text mining problem, then, can be viewed as imposing structure on text to make it amenable to the analytic techniques of data mining. This is often conceptualized as extracting metadata from text (Losiewicz et al, 2000). Machine Learning Data mining is based on a variety of computational techniques, some of which fall under the rubric of machine learning. Examples are decision trees, neural networks, and association rules (clustering). In this context, machine learning involves "the acquisition of structural descriptions from examples [which] can be used for prediction, explanation, and understanding." When the description can be used to classify the examples, all three are enabled, unlike purely statistical modeling which only supports prediction. By some views, however, machine learning is little
  • 14. Text Mining 14 more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis on searching "through a space of possible concept descriptions for one that fits the data" (Witten & Frank, 2000). From a broader artificial intelligence (AI) perspective, machine learning is one of the four capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear logical, rational, and intelligent to an intelligent human interrogator. In this context machine learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns" (Russell & Norvig, 1995). From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine learning is "the study of computer algorithms capable of learning to improve their performance of a task on the basis of their own previous experience" primarily through pattern recognition and statistical inference. They see a legitimate future role for it in "every element of scientific method, from hypothesis generation to model construction to decisive experimentation." Text mining could help with the "high data volumes" involved in literature searching. However, most work to date has focused on experimental data reduction such as visualization of high-dimensional vector data resulting from gene expression microarray studies (see footnote 6, p. 25). Natural Language Processing Natural language processing (NLP) or understanding (NLU) is the branch of linguistics which deals with computational models of language. A brief history is given by Bates (1995). Its motivations are both scientific (to better understand language) and practical (to build intelligent computer systems). NLP has several levels of analysis: phonological (speech), morphological (word structure), syntactic (grammar), semantic (meaning of multiword
  • 15. Text Mining 15 structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of multi-sentence structures), and world (how general knowledge affects language usage) (Allen, 1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle with meaning. NLP can differentiate how words are used such as by sentence parsing and part- of-speech tagging, and thereby might add discriminatory power to statistical text analysis. Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is widespread but the jury remains out. Rau (1988) described an early NLP system named SCISOR which was developed by General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was programmed to deal only with information on corporate mergers. Input (news stories, etc.) was described as being converted to "conceptual format" permitting natural language interrogation (i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down (expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing. Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences. Computerized parsing of free text "is an extremely difficult and challenging problem," according to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base containing grammatical and lexical information. The double parsing strategy of SCISOR allowed flexibility to perform in-depth analysis when complete grammatical and lexical knowledge is available, and superficial analysis when unknown words and syntax are encountered, giving the system robustness. The top-down parser could also be used for text skimming (looking for particular pieces of information). However, semantic analysis "is very expensive and furthermore depends on a lot of
  • 16. Text Mining 16 domain-dependent knowledge that has to be constructed manually or obtained from other sources" (IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton, 1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came of age and it was realized that the limitations of the linguistic techniques did not prevent them from being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus that NLP is still not competitive with statistical approaches to traditional IR, but that it may be practical and even critical for applications such as phrase extraction and text summarization. Even Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text summaries" (Salton, Allan, Buckley, & Singhal, 1994). Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her definition of the goal of text mining, in fact, is "capturing semantic information" as tabular metadata amenable to statistical data mining techniques. In her work, NLP includes stemming (morphological level), part-of-speech tagging (syntactic level), phrase and proper name extraction (semantic level), and disambiguation (discourse level). Goals include automating text mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text classification (see below). A "reverse flow" of purely statistical methods to NLP has been going on since about 1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown
  • 17. Text Mining 17 to significantly improve the accuracy of proper name classification, part-of-speech tagging, word sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and disambiguation improve probabilistic document retrieval ranking discrimination by some parts of speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which in turn reflect natural languages' relation to "naturally occurring dependencies in the physical world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like stemming and query expansion in improving the performance of an advanced IR system under rigorous test conditions (Perez-Carballo & Strzalkowski, 2000). Computational linguistics is used as a synonym for NLP by some writers and as a narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with finding statistical patterns in large text collections to inform algorithms for NLP techniques such as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e., computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being the bridge between NLP and statistics. They both envision text mining as a component of a full- featured information access system which also includes source detection, content retrieval, and analytical aids such as text visualization (see below). A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives (this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal, 1993). Therefore a good job for NLP would be to detect anaphors and search backwards to resolve their referent. In the language of logic, this might be called identifying the point in the text where each significant new proposition begins. In 1993, that was beyond available text processing capabilities, so the authors had to exclude anaphoric sentences from further analysis
  • 18. Text Mining 18 regardless of their information content. In summary, all this activity and interest raise hopes, but NLP still "has not delivered the goods" (Saracevic, 2001) and so the jury remains out. Text Summarization An obvious example of text mining would be to find previously unknown natural correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that, of course, one must identify the themes. A theme being a form of summary, automated theme- finding is a form of automatic text summarization (or automatic abstracting), a proud old IR tradition. Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation from Luhn (1958), who proposed extracting sentences based on their computed word content weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's idea of cues to "indicator constructs" such as In this paper we show that… Johnson et al (1993) built a NLP-based auto-abstracting system which selected non- anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary- based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.). Each word was then indexed by its sentence number, position within the sentence, part of speech, verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned
  • 19. Text Mining 19 up" by a set of corrective heuristics and a grammar-based tag disambiguator3. A global parser then identified noun phrases based on definitive cues such as being separated by a preposition (e.g., the primary factor in public health), and then parsed the sentence. The resulting sample abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most text summarization needs, the next step might be to take a page from statistical IR and develop ways of ranking the selected sentences. Template mining SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B dollars per share in a takeover bid for company C on date D. The values were extracted from raw text by parsing and stored in relational data tables. Then summaries of the parsed data values could be written by a natural language generator. This seems to be a form of template mining, where the script or metadata table field structure constitutes the template. Chowdhury (1999) describes template mining as a form of information extraction using NLP "to extract data directly from the text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template." Chowdury traces its history from the mid-1960s Linguistic String Project at New York University, where "fact retrieval" was conducted against template data mined from natural language text, up to its current (1999) use in the AltaVista and 3 An example of a sentence with intractable tag ambiguity would be Rice flies like sand, which could refer to the behavior of grain or insects (Allen, 1995, p. 13). Such a sentence would require higher (pragmatic and discourse) levels of analysis to disambiguate.
  • 20. Text Mining 20 Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the Message Understanding Conferences (MUCs). To facilitate template mining, Chowdhury recommends "standardization in the presentation and layout of information within digital documents" through the use of templates for document creation. But this is contrary to the spirit of text mining, which is to liberate both the creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty – hopefully premature! Theme Finding Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be applied to theme generation and text summarization. The authors derived the notion of passage retrieval from the problem of ranking vector matches when the vectors are of different lengths, e.g. very short queries against long documents, or clustering documents of different sizes. One solution is to decompose the documents into subunits of roughly equal size, called "passages." A common passage unit is a paragraph. The passages may be converted to normalized vectors and compared. Those with similarities above a certain threshold (which may be chosen to deliver a desired degree of abstraction) are considered connected. If the documents are plotted as arcs on the circumference of a circle and their component passages connected by straight lines in accordance with their
  • 21. Text Mining 21 vector similarities, the resulting starburst pattern can convey themes within and between documents. These themes can be focused by expressing each triangle of passage similarities as a centroid and doing similarity calculations on the centroids. One may want to compute an estimate of the "most important" passages for the purpose of selective text traversal ("skimming") or text summarization. Such passages might be identified as (a) having a large number of above-threshold similarity connections, (b) strategic position (e.g., the first paragraph in each section), or (c) high similarity to some reference node. The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be combined; e.g., start with some desired passage (as in "more like this"), go to the most similar sectional heading passage, then go to its strongest link, the select the other densely connected nodes in that cluster in chronological order. For text summarization, repetition can be edited out on the basis of similarities between sentences or other subunits which are "too high." Text Categorization Text categorization should not be considered a form of text mining because it is a "boiling down" of document content to "pre-defined labels" which "does not lead to discovery of new information" since "presumably the person who wrote the document knew what it was about," according to Hearst (1999). Presumably she would also rule out text summarization and auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of categorization is to find "unexpected patterns" or "new events" because these "tell us something about the world, outside of the text collection itself" and therefore qualify as new information. I would argue, however, that it is not so easy to predict where "new information" will come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a
  • 22. Text Mining 22 form of separating "precious nuggets" from "worthless rock" according to the human idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all text mining, but just to highlight the fuzziness of the boundaries between them. Clustering Clustering can be used to classify texts or passages in natural categories that arise from statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of traditional manual indexing systems. In the context of text mining, it is the derivation of the categories which is of interest, since this is a form of theme finding and therefore text summarization. Once the texts are clustered on the basis of common themes, it may also be useful to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank (2000, Section 6.6). Filtering E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank, 2000). The relevance of related techniques such as name recognition, theme finding, and text categorization are obvious, and it is even possible to imagine software which modifies its own filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable to find reports of any actual work on such a system. Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's
  • 23. Text Mining 23 famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term… regular information interests" of IF compared to the "periodic… information need or ASK" of IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at another striking comparison: the IF network looks exactly like an upside-down IR network! That is, in IR multiple documents are percolating down to a single user, while in IF each single incoming document is percolating down to multiple users. However, the authors reject this analogy for reasons not entirely clear to me.4 Text Visualization Text visualization shares text mining's goals of using computational transformations to reduce the cognitive effort of dealing with large text corpora, highlight patterns across documents, and help discover new knowledge. Text mining implies homing in on "precious nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice both may be regarded as elements of a holistic approach to multi-text corpora. The text mining systems of Hearst, Kostoff, and Liddy all have explicit text visualization components. Wise (1999) developed a text visualization paradigm for intelligence analysis named Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of ‘visualizing text’ in order to reduce information processing load and to improve productivity" by representing large numbers of documents to permit "rapid retrieval, categorization, abstraction, and comparison, without the requirement to read them all." The theory behind SPIRE was that 4 They seem to feel that "P(oj|pi)", the probability that the incoming document will satisfy the information need given a user's filtering profile, is poorly understood compared to the conventional Bayesian need-query-document relationships, but I'm not sure the latter are so well-understood, either.
  • 24. Text Mining 24 humans’ most highly evolved perceptual abilities are those involved in interpreting "visual features of the natural world." Therefore the goal was to represent text as natural, ecological images from our early hominid past which require no "prolonged training to appreciate and use" such as star fields or landscapes (Figure 1). This transformation was accomplished using standard vector space algorithms and involves clustering and text summarization. SPIRE is an excellent example of how a cognitive theory can be helpful in inspiring IR innovation and guiding system development, despite its apparent lack of commercial success.5 Text Compression As mentioned at the beginning, I started this paper by trying to narrow the definition and scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One by one, however, the other strategies refused to be cleanly differentiated, and the foregoing polyglot review is the result. The only concept I thought I had succeeded in banishing from the scope of text mining was data compression, which showed up in the title of a single citation in a literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was something I could confidently rule out. But on page 334, Witten and Frank (2000), in discussing statistical character-based models for token classification (names, dates, money amounts, etc.), note that "there is a close connection with prediction and compression: the number of bits required to compress an item with respect to a model can be interpreted as the negative logarithm of the probability with which that item is produced by the model." That is, text compression algorithms might function as 5 Cartia, Inc., which was marketing the ThemeScape™ software (Figure 2, downloaded Fall 2000), no longer has any detectable presence on the Web.
  • 25. Text Mining 25 token classifiers in reverse! So I give up. Text mining appears to be related to just about everything on my original list. Biomedical Applications My interest in text mining is motivated primarily by the belief that it can be fruitfully applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge. I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of gene sequence analysis is based, after all, on nothing more than algorithms for finding and comparing patterns in the four-letter language of DNA. Swanson's work has focused on MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of the function of newly sequenced genes" by determining which novel genes are "co-expressed with already understood genes which are known to be involved in disease." Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as "extracting information about predefined classes of entities and relationships from natural language texts and placing this information into a structured representation called a template" [is it therefore template mining?], to build a database of information about enzymes, metabolic pathways, and protein structure from full text biomedical research articles. The LaSIE (Large Scale Information Extraction) system includes modules for datatype recognition (names, dates, etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template filling. It does linguistic analysis at all levels up to discourse using lexical knowledge, morphology, and grammars to identify significant words. The enzyme and metabolic pathway variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles
  • 26. Text Mining 26 (substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields include concentration and temperature. The PASTA variant deals with protein structure information such as which amino acid residues occupy given positions, active and binding sites, secondary structure, subunits, interactions with other molecules, source organism, and SCOP category. The prototype has been tested on only six journal papers, so it is far from satisfying the large text corpus requirement for true text mining, but the authors make no such claim. The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf, Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort out the thousands of gene expression correlations resulting from microarray experiments6 to separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an Israeli gene information database) on the expressed genes. [Gene names generally make good search words because they are different from normal English words, e.g. "JAK3".] The second module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity or term frequency scores (NLP criteria being regarded as too computationally expensive). The third module is a "carefully designed user interface" to facilitate access to the most likely-to-be- interesting documents. Despite the name, then, MedMiner is not a true text mining system, but rather a search and display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval, and no integration with GeneCards, although it is integrated with other gene and protein databases). Like Kostoff's system, it is designed to deal with highly technical information by assisting expert users in their traditional IR tasks rather than attempting to automate them 6 Basically, a square chip coated with an array of known DNA sequences at known locations on the chip is dipped into a broth containing the expressed messenger RNA (mRNA) from cells under given conditions. The mRNA is labeled so that when it binds to its complementary DNA on the chip the gene expression pattern is revealed. Gifford (2001) briefly reviewed the direct application of data visualization to gene expression data not involving any text.
  • 27. Text Mining 27 completely. MedMiner is freely available online at http://discover.nci.nih.gov. Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER attempts to identify noun phrases representing molecular entities such as drugs, receptors, enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain, sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing, the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms. While terminology extraction might be considered a fairly trivial form of text mining, it is obviously a logical step toward the mining of binding relationships (A binds B) which would have enormous potential for knowledge discovery. Stapley and Benoit (2000) developed a system named “BioBiblioMetrics” (Stapley, 2000) which uses text visualization to suggest functional clusters of genes from the yeast Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co- occurring genes are displayed in a graph with “nodes” representing genes and edge lengths between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to
  • 28. Text Mining 28 sequence databases, and edges to those MEDLINE documents that generated them, creating a biomedical information “landscape” and inference network. BioBiblioMetrics is freely available online at http://www.bmm.icnet.uk/~stapleyb/biobib/. Other MEDLINE text mining papers which I did not have a chance to review in full involve dictionary-controlled natural language processing for extraction of drug-gene relationships (Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur & Yang, 1996); statistical text classification and a relational machine-learning method (Craven & Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a) provides online full-text access to many biomedical text mining papers, including those from the hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing. Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and enunciated a radical vision which includes several of the themes of this paper, such as the analogy between text and genome analysis, and the long history of information extraction in its many guises. He see the challenge as "understanding the nature of biological text, whatever that turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules and grammars of Chomskian linguistics are more hindrance than help. Frankly, a fresh new approach is needed, fueled by the conviction that language is a biological phenomenon, not a logical phenomenon. By this we mean that the nature of language is as messy as the genome. The data and observed phenomena in all their richness and variety are dominant and cannot subsumed by any elegant theories. This means that in many ways, biologists have far better hopes of cracking the NLP problem than the computational linguists, who are focused on mathematics and logic. Even when they look at data, it is primarily as grist for their math mills.
  • 29. Text Mining 29 Futrelle recommends, for example, building visualization tools such as a protein noun phrase highlighter which could be used to "assemble a large collection of the standard textual expression forms [and] map these onto the query forms for which they are the answers." But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a coherent theory based on the biological nature of language. By this I mean that language is a communicative capability of living organisms that has evolved from deep biological roots and from social interactions over millions, and ultimately, billions of years. I claim that language is not logical and mathematical, because that's not the nature of the organism (us) that exhibits the language capability. An example of this is found in our vocabularies. A technically skilled adult will have a vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or "ship" does not follow from the characters that make them up. We simply commit them to memory. Linguists would like us to believe that our natural ability to "parse" is radically different and can be explained as a rule-based system. My radical view is that we understand language not by generalization to abstract rules as much as by retaining examples and generalizing from them as needed. This is quite within our capacity, given our 100,000 word vocabularies. We also do reason. I would claim, again in the biological view, that this is done more by "imagined life" than by logic. Humans have superb abilities to remember events and to build detailed mental plans for future activities …. So we need to build this type of reasoning into our systems. The analogy to genomics is clear. The coding of a particular protein by a particular sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail (such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking for patterns within the data. Purely logical approaches must wait for a richer knowledge base. Only now, after the massive effort of half a century of molecular genetic research, sequencing whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome, can we begin to think about prediction of protein structure and function from sequence data alone. Biological linguistics now stands at the beginning of a comparably arduous journey. These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on human expertise and manual filtering in a better light. Perhaps they do not represent premature
  • 30. Text Mining 30 surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text mining and the cognitive tradition in IR would make a worthy topic for another paper.
  • 31. Text Mining 31 References Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA: Benjamin/Cummings. Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proceedings of the international conference on intelligent systems for molecular biology 5:25-32. Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7):600-607. Bates, M. (1995). Models of natural language understanding. Proceedings of the National Academy of Sciences, 92, 9977-9982. Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35, 29-38. Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract- ion of biological information from scientific text: protein-protein interactions. Proceedings of the international conference on intelligent systems for molecular biology, pp.60-67. Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108. Cartia, Inc. (2000). ThemeScape product suite. Formerly online: http://www.cartia.com/ products/index.html [no longer accessible]. Chowdhury, G. G. (1999). Template mining for information extraction from digital documents. Library Trends, 48, 182-208. Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. Proceedings of the International Conference on
  • 32. Text Mining 32 Intelligent Systems for Molecular Biology, pp.77-86. Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of textual data. KDD-99, Association of Computing Machinery. Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the Association for Computing Machinery, 8, 223-239. Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html Futrelle, R. P. (2001a). Natural language processing of biology texts. Online: http://www.ccs.neu.edu/home/futrelle/bionlp/ Futrelle, R. P. (2001b). The past, present and future of biology text understanding. Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli Gardens, Copenhagen, Denmark, July 26. Online: http://www.ccs.neu.edu/home/futrelle/brie2001/index.html Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293, 2049-2051. Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html Hearst, M. (1997). Distinguishing between web data mining and information access. Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA. Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99- tdm.html
  • 33. Text Mining 33 Hearst, M. (2001). About TextTiling. Online: http://www.sims.berkeley.edu/~hearst/tiling-about.html Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of information extraction for scientific journal articles. Journal of Information Science, 26, 75-85. IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview. Online: http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/im4t23over8.htm IBM (1998b). Text mining technology: Turning information into knowledge: A white paper from IBM. Daniel Tkach (Ed.). Online: http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive approaches for information retrieval. Libri, 45, 160-177. Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management, 1, 215-241. Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class, Rutgers University, School of Communication, Information, and Library Studies, New Brunswick, NJ. Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online: http://www.dtic.mil/dtic/kostoff/Swanson2.txt Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific literature with text mining. Analytical Chemistry (in press). Online: http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt
  • 34. Text Mining 34 Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001). Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of the American Society for Information Science, 52, 1148-1156. Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining using database tomography and bibliometrics: A review. Online: http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm KRDL (2001). Text mining: transforming raw text into actionable knowledge (white paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/ Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief survey of web data extraction tools. In press. Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University, New Brunswick, NJ. Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics. Journal of the American Society for Information Science, 50, 574-587. Losee, R. M. (2001a). Natural language processing in support of decision-making: phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787. Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of the American Society for Information Science, 52, 1019-1025. Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support science and technology management. Online:
  • 35. Text Mining 35 http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165. Marcus, M. (1995). New trends in natural language processing: Statistical natural language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059. Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and future prospects. Science, 293, 2051-2055. Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval: Progress report. Information Processing and Management, 37, 155-178. Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck & Co., Inc., Rahway, NJ. Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis. Bulletin of the American Society for Information Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/qin.html Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique Documentaire, 1997, General Electric, USA. Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding terminology from biomedical text. Proceedings of the American Medical Informatics Association Symposium, 1999, 127-131. Online: http://www.amia.org/pubs/symposia/D005564.PDF Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on
  • 36. Text Mining 36 Biocomputing, 2000, 517-528. Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall. Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and Management, 28, 441-449. Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264, 1421-1426. Saracevic, T. (2001). Personal communication and class discussions, Seminar in Information Studies, Rutgers University, School of Communication, Information and Library Studies, New Brunswick, NJ. SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International Conference on Data Mining, Arlingon, VA, April 13, 2002. Online: http://www.cs.utk.edu/tmw02/ Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings: identification of findings in medical literature using restricted natural language processing. Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996, 239-243. Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available: http://www.bmm.icnet.uk/ ~stapleyb/biobib/ Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on Biocomputing, 2000, 529-540. Swanson, D. R. (1988). Historical note: Information retrieval and the future of an
  • 37. Text Mining 37 illusion. Journal of the American Society for Information Science, 39, 92-98. Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91, 183-203. Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51. Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 52, 797-812. Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999). MedMiner: An Internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques, 27, 1210-1217. Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing, 2000, 541-552. Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine, 26(3):209-222. Wise, J. A. (1999). The ecological approach to text visualization. Journal of the American Society for Information Science, 50(13):1224-1233. Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).
  • 38. Text Mining 38 Table 1. Initial List of Information Retrieval (IR) Concepts Related to Text Mining. IR concept Authority (see References) Artificial intelligence Fan; Perrin Bioinformatics Futrelle; Perrin Citation mining Kostoff Computational Linguistics Fan; Hearst Conceptual Graphs KRDL Data Abstraction Fan Data Mining Fan; Perrin; SDM Database Tomography Kostoff Document Mining Fan Domain Knowledge KRDL Electronic Commerce Fan Factor Analysis SDM Information Access Hearst Information Extraction Chowdhury; Fan; Futrelle; Kostoff; Perrin Information filtering Fan Information Integration Fan Information Retrieval Fan; Perrin Information Visualization/Mapping Futrelle; Fan; SDM Intelligent Agents ("bots") Fan
  • 39. Text Mining 39 Knowledge Discovery Fan Knowledge Extraction Perrin Knowledge Representation Perrin Language Identification IBM Machine Learning Fan; Futrelle; Perrin Metadata Generation SDM Natural language processing Fan; Futrelle; Perrin; Rindflesch; Saracevic Ontologies/Vocabularies/Lexicons Futrelle Phrase Extraction Fan Question Answering Futrelle Resource Discovery Fan Resource Indexing Fan Semantic Modeling Perrin; SDM Semantic Processing Rindflesch Statistical Language Modeling Fan Stemming SDM Syntactic Processing Saracevic Template Mining Chowdhury; KRDL Text Analysis Futrelle; IBM Text Classification/Categorization Fan; Hearst (distinct); IBM; SDM Text Clustering Fan; IBM Text Data Mining Hearst; Kostoff Text Parsing SDM
  • 40. Text Mining 40 Text Purification SDM Text Segmentation/"TextTiling" Hearst; SDM Text Summarization Futrelle; IBM; Saracevic; SDM Text Understanding Futrelle; Fan Web Data Mining Hearst Web Mining Fan Web Utilization Mining Fan
  • 41. Text Mining 41 Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents (Cartia, 2000, expired website).
  • 42. Text Mining 42 Figure 2. BioBiblioMetrics retrieval from a search on “DNA repair” and “recombination” (Stapley, 2000).