Whenever we face requirements that implies a search in documents, like microsoft word,pdf, XML files or even long text stored in an Oracle table.
The solution is usually searched in the market, as an expensive tool, or developed as in-house complex solutions involving many people for a long time project.
There is another solution full of resources at our reach.
Oracle databases, with licenses for (EE, SE, PE) have an option for indexing text and documents, embedded or referenced in tables, with resources to deal with stop-lists, prefix search, accents, case, references, XLM, HTML and tags.
This option is called Oracle Text and this presentation is an overview of its resources and features.
2. Agenda
ď Oracle Text Overview
ď Introduction
ď Oracle Text Overview
ď Types of Index
ď Text Query Application
ď Document Presentation and Highlighting
ď Document Samples
ď Oracle Text Indexing Process
ď Indexing Classes
ď Examples
ď Contains Operators
ď POC
ď Training & Reference
ď Questions
3. Introduction
I am a forward-looking Information Systems Architect with a
solid Oracle DBA background comprising the daily
infrastructure tasks of the DBA, several projects as a Data
Modeler, and performance management projects.
I Started on the mainframe business, and soon had a deep dive
in application development for Oracle databases. After
acquiring an Oracle certification, I worked on performance
enhancement for applications using Oracle databases, and later
worked several years as an infrastructure DBA, later I worked
on data modeling projects and more recently a performance
management project, on both application and database layers.
4. âThe limits of my language
mean the limits of my world.â
Ludwig Wittgenstein
5. What is Oracle Text
â˘An option the database that extends the text indexes
â˘It is a free option for Oracle DB (EE, SE, and PE)
â˘Has cataloging, referencing and classification features
â˘Deals with tags, such as HTML or XML
â˘Extends indexing for:
â˘Documents stored in tables or referenced
â˘PDF, MS Word, XML, text, ...
â˘using data types as BLOB, BFILE, CLOB, long, ...
â˘even web pages, stored or referenced
7. Types of Index
Type of Query
Index Description Operator
CONTEXT Use this index to build a text retrieval application when your text consists of CONTAINS
large coherent documents. You can index documents of different formats such
as Microsoft Word, HTML, XML, or plain text.
You can customize your index in a variety of ways.
CTXCAT Use this index type to improve mixed query performance. Suitable for querying CATSEARCH
small text fragments with structured criteria like dates, item names, and prices
that are stored across columns.
CTXRULE Use to build a document classification application. You create this index on a MATCHES
table of queries, where each query has a classification.
Single documents (plain text, HTML, or XML) can be classified by using the
MATCHES operator.
9. Document Presentation and
Highlighting
Output Procedure
Plain text version, no highlights CTX_DOC.FILTER
HTML version of document, no highlights CTX_DOC.FILTER
Highlighted document, plain text version CTX_DOC.MARKUP
Highlighted document, HTML version CTX_DOC.MARKUP
Highlight offset information for plain text version CTX_DOC.HIGHLIGHT
Highlight offset information for HTML version CTX_DOC.HIGHLIGHT
Theme summaries and gist of document. CTX_DOC.GIST
List of themes in document. CTX_DOC.THEMES
12. Indexing Classes
Class Description
Datastore How are your documents stored?
Filter How can the documents be converted to plaintext?
Lexer What language is being indexed?
Wordlist How should stem and fuzzy queries be expanded?
Storage How should the index data be stored?
Stop List What words or themes are not to be indexed?
Section Group How are documents sections defined?
14. Example DDL and Query
DROP INDEX dbapp.IDX_st_address_3; SET DEFINE OFF;
CREATE INDEX dbapp.IDX_st_address_3 ON SELECT NOM_st_address
dbapp.st_address(NOM_st_address) FROM dbapp.st_address
INDEXTYPE IS CTXSYS.CONTEXT WHERE CONTAINS (NOM_st_address, 'ST&MAJOR&OSCAR&STONE', 1) > 0;
PARAMETERS ('LEXER address_lx
WORDLIST address_wl Plan
STOPLIST address_sl
STORAGE address_st') SELECT STATEMENT CHOOSE Cost: 18 Bytes: 118 Cardinality: 1
PARALLEL 8; 2 TABLE ACCESS BY INDEX ROWID dbapp.st_address_3 Cost: 18 Bytes: 118 Cardinality: 1
COMMIT;
1 DOMAIN INDEX dbapp.st_address_3 Cost: 15
BEGIN
NOM_st_address
SYS.DBMS_STATS.GATHER_TABLE_STATS (
--------------------------------------------------
OwnName => 'dbapp'
OSCAR STONE MAJOR
,TabName => 'st_address'
,Estimate_Percent => NULL SET DEFINE OFF;
,Method_Opt => 'FOR ALL INDEXED COLUMNS SIZE SELECT NOM_st_address
AUTO ' FROM dbapp.st_address
,Degree => 8 WHERE CONTAINS (NOM_st_address, 'ST&OSCAR&STONE', 1) > 0;
,Cascade => TRUE
,No_Invalidate => FALSE); NOM_st_address
END; --------------------------------------------------
/ JOSE OSCAR STONE
........
OSCAR STONE MAJOR
OSCAR WEBBER STONE
28 rows selected.
16. Training
Resources at Oracle website
⢠Text Application Developer's Guide
http://docs.oracle.com/cd/B10501_01/text.920/a96517/toc.htm
⢠Text Reference
http://docs.oracle.com/cd/B10501_01/text.920/a96518/toc.htm