Clue Extraction Instruction File

Clue Extraction Instruction File, also called as SCE file, is used by DataScraper to extract clues from target Web pages. The files in this type are stored in DataStore server's folder $CATALINE/work/DataStore/context/extraction/config/<theme_name>/. The names of the files are suffixed with .sce.xml. The structure of these files is shown as follows:

<?xml version="1.0" encoding="UTF-8"?>

<spider-clue-extraction>

<theme>testTheme</theme>

<path-type>a</path-type>

/html/body/p/a

<context>//*[@id='blueFrame']</context>

<context>//*[@id='rightFrame']</context>

</path>

<relative>//*[@id='listbottom']</relative>

<clue-type>newthread</clue-type>

<target-theme>

<name>newTheme</name>

<url-prefix><![CDATA[1]]></url-prefix>

<prefix-position>hostname+pathname</prefix-position>

</target-theme>

</scope>

</spider-clue-extraction>

Where

path-type can take one of the following values:
- a: means the XPath expression contained in path locates an HTML A element.
- href: means the XPath expression contained in path locates an HTML A element's href attribute.
- scope: is not used.
- relative: is used for clues in type of Relative.

GooSeeker

Clue Extraction Instruction File

Languages