ctxhx process and Oracle Text indexes [message #680343] |
Fri, 08 May 2020 04:51 |
John Watson
Messages: 8944 Registered: January 2010 Location: Global Village
|
Senior Member |
|
|
I have a process $OH/ctx/bin/ctxhx that has been taking 100% of a core for several hours. It seems to be associated with this call (object names obfuscated),
CREATE INDEX XXX_docs_content ON XXX_docs(blob_content) INDEXTYPE IS CTXSYS.CONTEXT
There are about 45000 rows in the table, total file size is 39GB. I can't find much about this ctxhx process, but I think it is used to extract text from the blob docs which are mostly .PDF with some .DOCX too.
Is there some way to monitor what is going on? Or predict how long it might take? There is nothing in v$session_longops, there doesn't seem to be much happening in terms of segment space usage, no logs I can find.
Thank you for any insight.
|
|
|
Re: ctxhx process and Oracle Text indexes [message #680349 is a reply to message #680343] |
Fri, 08 May 2020 16:13 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Ctxhx.exe is used to convert files to html, so that Oracle Text can tokenize them to index them. For example, if I run:
host ctxhx "Alpha List.pdf" test.txt
then I get an html file in test.txt. If you suspect a problem with a particular file or file type, then you can use this to test such a file.
To monitor the progress of indexing, you can:
EXEC CTX_OUTPUT.START_LOG ('test.log')
CREATE INDEX XXX_docs_content ON XXX_docs(blob_content) INDEXTYPE IS CTXSYS.CONTEXT
/
EXEC CTX_OUTPUT.END_LOG
While this is running, you can display the contents of test.log. On my system I can type it from a Windows command prompt. On my system it is in:
C:\app\baboehme\product\12.1.0\dbhome_2\ctx\log\test.log
It may be in a different location on your system, depending on operating system, Oracle version, and whether you have changed the default log_directory parameter.
There are also other parameters that can be set for the ctx_output besides just start_log and end_log.
I suspect your problem is with some unrecognized document format, such as something from a very new version of pdf of docx that the Oracle filter that uses ctxhx does not yet support.
|
|
|
|
Re: ctxhx process and Oracle Text indexes [message #680516 is a reply to message #680361] |
Tue, 19 May 2020 10:45 |
John Watson
Messages: 8944 Registered: January 2010 Location: Global Village
|
Senior Member |
|
|
Just in case it is ever useful to anyone, the job took about 9 hours to index about 45000 blobs totalling about 35GB. That was on an AWS t3a.medium instance. During that time, the ctxhx process was running continuously (or rather, continually starting, running,stopping, re-starting) and occupied 100% of one vCPU. Very little activity in the DB. I'll upload the log, which is a trace file nowadays: you can give it a name, but that gets ignored.
|
|
|