Nov 29, 2010

Oracle Text Search - Index tuning for handling HTML content

We used Oracle Text Search in our web application for a secure messaging module years ago. Recently, when we added rich text formatting in that module, we discovered that Text Search feature was not working correctly because of the html tags inside the clob content.

So, we had to fix the thing to make that workable again. Basically, we added few parameters while creating the index to handle html properly.


--old
CREATE INDEX "MSG_CONTENT_TEXT_I" ON "MESSAGE_CONTENT" ("CONTENT") INDEXTYPE IS "CTXSYS"."CONTEXT";

--corrected one
CREATE INDEX "MSG_CONTENT_TEXT_I" ON "MESSAGE_CONTENT" ("CONTENT") INDEXTYPE IS "CTXSYS"."CONTEXT" PARAMETERS ('FILTER CTXSYS.NULL_FILTER SECTION GROUP CTXSYS.HTML_SECTION_GROUP');


Where,
* NULL_FILTER: No filtering required. Use for indexing plain text, HTML, or XML documents.
* HTML_SECTION_GROUP: Use this group type for indexing HTML documents and for defining sections in HTML documents.

Life is good again!

Here is a good document for reference-
http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/cdatadic.htm