Clue Extraction Instruction File

Clue Extraction Instruction File, also called as SCE file, is used by DataScraper to extract clues from target Web pages. The files in this type are stored in DataStore server's folder $CATALINE/work/DataStore/context/extraction/config/<theme_name>/. The names of the files are suffixed with .sce.xml. The structure of these files is shown as follows:

<?xml version="1.0" encoding="UTF-8"?>
<spider-clue-extraction>
<theme>testTheme</theme> <!-- theme name -->
<scope> <!-- scope or juston(not used) -->
<from>HTML</from> <!-- Where are clues to be extracted. HTML means from Web pages; transDOM means from intermediated DOMs -->
<path-type>a</path-type> <!-- explained in the chapter following this XML file -->
<path> <!-- an XPath expression to locate clues -->
/html/body/p/a
<context>//*[@id='blueFrame']</context> <!-- If target clues are embraced by multiple nested HTML IFRAME/FRAMEs, it specifies all of them in nesting order. -->
<context>//*[@id='rightFrame']</context>
</path>
<relative>//*[@id='listbottom']</relative> <!-- If path-type takes value of relative, it denotes the parent element. -->
<clue-type>newthread</clue-type> <!-- clue types: newthread or inthread -->
<target-theme> <!-- If clues are in type of newthread, the name of the new theme should be assigned. -->
<name>newTheme</name>
<url-prefix><![CDATA[1]]></url-prefix> <!-- In case of clues being in type of Pattern Clue, it means the prefix of target URLs. -->
<prefix-position>hostname+pathname</prefix-position> <!-- not used -->
</target-theme>
</scope>
</spider-clue-extraction>

Where

  • path-type can take one of the following values:
    • a: means the XPath expression contained in path locates an HTML A element.
    • href: means the XPath expression contained in path locates an HTML A element's href attribute.
    • scope: is not used.
    • relative: is used for clues in type of Relative.