Clues can be classified from two different viewpoints: 1, how to be used by DataScraper; 2, how to extract clues.
Clue types on how to use clues
The clues are classified into the following two types:
- In-thread clue: is the clue extracted and used in the same data extraction task executed by DataScraper. Let's give an example that a forum's topics are paginated. There is a link named as "Next Page" for users to turn to the next page. DataScraper tries to extract all topics in one single task. So it create many in-thread clues when turning pages. All the clues are produced and consumed within this task without storing them on to DataStore's database.
- New-thread clue: is the clue extracted from current page and stored onto DataStore's database. As a result a new SpiderClue record with status value of start is inserted into the table. The clues in this type will not be consumed by current data extraction task. When DataScraper starts a new data extraction task, a clue with status start is retrieved from DataStore server.
Clue types on how to extract clues
The clues are classified into the following five types:
- Info Clue: The clues in this type are extracted using data extraction rules instead of clue extraction rules. They can be viewed as side products during extracting data snippets from the Web.
- Single Clue: The clues in this type are extracted from fixed locations of target HTML page. So changes on structure of HTML document greatly impact the rules for Single Clues.
- Marker Clue: The clues in this type are found under specific marks within a specific scope of target HTML page. For example, if a link "Next Page" is used to navigate to another page, the string "Next Page" can act as a mark. A clue can be extracted from the attribute href in the HTML A element which embracing the string.
- Pattern Clue: The clues in this type are extracted within a specific scope of target HTML page by matching specific patterns in URLs. For example, within a specific scope all URLs are extracted which contain string "http://www.gooseeker.com/en".
- Relative Clue: The clues in this type are extracted within a specific scope of target HTML page by referring to specific HTML DOM nodes. For example, if topics in a forum are paginated and information [1] [2] [3] .. are presented for navigating to different pages and current page No. is [2], the DOM node denoting [2] is selected as an referee and a clue can be extracted over DOM node denoting [3]. The referee and referrer must be in brother relation. Compared with Marker Clue, the clues in this type are more prone to changes of HTML page's structure. So this type should only be used in case there is not a marker clue.
Clue's attributes
Clues in any of above five types can be either in-thread or new-thread ones. When a new clue is created on the Clue Editor work board, its type can be chosen by pushing one of five radio buttons denoting five types respectively. After a radio button is pushed, the focus is pointed to the window tabbed with the name of this type. In this window some GUI elements are presented for operators to set clue's attributes. In contrast, other tab windows are empty.
- Info Clue
- Property Name: is not editable whose value is a concatenation of names of the bucket and the property separated by the character ".".
- Target Theme: is filled automatically by MetaStudio with the name of current theme by default. In most cases, the target theme would be different from the current. So operators should change the default one. When DataScraper finds clues in this type, it will tell MetaCamp server to allocate a new theme. In case the theme has already be allocated, MetaCamp will do nothing. Otherwise MetaCamp server will insert a new record into the theme table, whose status is torecognize. Operators can push button query to check if a name has already be allocated.
- Single Clue
- Target Theme: is filled automatically by MetaStudio with the name of current theme by default. In most cases, the target theme would be different from the current. So operators should change the default one. When DataScraper finds clues in this type, it will tell MetaCamp server to allocate a new theme. In case the theme has already be allocated, MetaCamp will do nothing. Otherwise MetaCamp server will insert a new record into the theme table, whose status is torecognize. Operators can push button query to check if a name has already be allocated.
- Marker Clue
- Marker Row No.: is the serial number in the DOM tree viewer of the DOM node acting as the mark. The number cannot be inputted manually. It is set via marker mapping by clicking right-button pop-up menu Clue Mapping>>Marker Mapping over the DOM tree viewer. Only a text node which is embraced by an HTML A element can be mapped.
- Marker Value: can be input manually or by marker mapping. Over the DOM tree viewer, after a DOM node has been selected, click the right-button pop-up menu Clue Mapping>>Marker Mapping to map the mark so that the value of the mark is input into the edit box Marker Value automatically. In some cases, the value should be modified which is stated in detail in MetaStudio User's Guide#Map clues.
- Matching rule of Marker Value: is set by ticking the checkbox . Being ticked means a full match rule in format of [path]=”[value]”. Not being ticked means a partial match rule in format of contains([path],”[value]”).
- Target Theme: is filled automatically by MetaStudio with the name of current theme by default. In most cases, the target theme would be different from the current. So operators should change the default one. When DataScraper finds clues in this type, it will tell MetaCamp server to allocate a new theme. In case the theme has already be allocated, MetaCamp will do nothing. Otherwise MetaCamp server will insert a new record into the theme table, whose status is torecognize. Operators can push button query to check if a name has already be allocated.
- Pattern Clue: A pattern of URLs is made up of a string of characters which should appears in the target URLs. URL pattern is a method to pick URLs within an HTML scope. On most sites, the URLs with same patterns often point to pages with same data schema. If a site breaks the principle, this method should not be taken. In current release, the pattern should appear in the prefixes of the URLs. On MetaStudio's Clue Editor work board, multiple patterns can be defined and placed in the pattern list, over which a menu pop up by clicking the mouse's right button to manage the patterns, e.g. to insert or delete a pattern. There are the following three fields in a pattern record:
- No.: is the serial number of the pattern, which doesn't contribute to calculation of clue extraction rules. Left to the number, there is a checkbox clicking which means the pattern record is selected for other operations, e.g. deletion.
- Loc Prefix: is the pattern string which appears in prefixes of URLs. The string can be input manually or by pattern mapping. Over the DOM tree viewer, after a DOM node has been selected, click right-button pop-up menu Clue Mapping>>Pattern Mapping>>xxx to map a pattern, where xxx is the serial number of the pattern. During mapping, MetaStudio extracts the value of attribute href in an HTML A element and inputs it into this edit box. The operator may have to delete the latter half behind the pattern string of the value.
- Target Theme: is filled automatically by MetaStudio with the name of current theme by default. In most cases, the target theme would be different from the current. So operators should change the default one. When DataScraper finds clues in this type, it will tell MetaCamp server to allocate a new theme. In case the theme has already be allocated, MetaCamp will do nothing. Otherwise MetaCamp server will insert a new record into the theme table, whose status is torecognize. Operators can push button query to check if a name has already be allocated.
- Relative Clue: is mainly used by DataScraper to turn pages so that the clues in this type are in-thread. The clues in this type have the following attributes:
- Current Node: is the DOM node contains a value denoting the current page's number. After the node has been positioned in the DOM tree viewer, click right-button pop-up menu Clue Mapping>>Relation Mapping>>Current Node to map the referee. As a result, the node's serial number appears behind the label Current Node.
- Next Node: is the DOM node contains a value denoting the next page's number. The node should be the older brother of the referee node. It is often an HTML A element. After the node has been positioned in the DOM tree viewer, click right-button pop-up menu Clue Mapping>>Relation Mapping>>Relative Node to map the referrer. As a result, the node's serial number appears behind the label Relative Node.
- Target Theme: is filled automatically by MetaStudio with the name of current theme by default. In most cases, the target theme would be different from the current. So operators should change the default one. When DataScraper finds clues in this type, it will tell MetaCamp server to allocate a new theme. In case the theme has already be allocated, MetaCamp will do nothing. Otherwise MetaCamp server will insert a new record into the theme table, whose status is torecognize. Operators can push button query to check if a name has already be allocated.