If an operator wants to extract all DOM nodes or all textual contents, he should set attribute block for the property. The following steps should be taken to set it:
- Create a property
- Map a DOM node, denoting an HTML scope by a DOM sub-tree below it, to the property via MetaStudio.
- Set rules to filter nodes within this scope.
There are two ways to set rules as follows:
- Specific: Push radio button Specific to set pre-defined filtering rules. The following types of contents can be filtered and extracted:
- All: The whole DOM sub-tree is extracted. The rule can be used to copy a fragment of a HTML document. The node denoting the boundary of the scope is included.
- Image: All HTML IMG elements are extracted within the scope, which is used to extract images for an online gallery.
- Text: All textual contents are extracted within the scope where all values of text nodes are concatenated together with space characters as their delimiters.
- General: Push radio button General to set customized filtering rules. Operators should input an XPath expression which is relative to the node mapped to this property. Then the operator should determine either HTML fragments or concatenated texts are to be extracted by pushing according radio buttons. If HTML fragments are to be extracted, MetaStudio will generate an data extraction rule containing a xslt:copy-of command. Otherwise, the rule contains a xslt:value-of command.
Note: In order to correctly define customized filtering rules, operators should be familiar with XPath and XSLT. Inside MetaStudio the following expressions are generated:
<xsl:copy-of select="XPath expression for the mapped-to node/customized XPath expression"/>
Or
<xsl:value-of select="XPath expression for the mapped-to node/customized XPath expression"/>
Obviously, customized filtering rules must be valid to be integrated into above two expressions. In current release, free form XSLT commands are not permitted. The feature cannot be used in the scenario where a DOM tree is to be extracted with some of its sub-tree excluded.
GUI operations
On the Extraction Editor work board of MetaStudio, there are two ways to set attribute block:
- Set the attribute in property's attribute editing window. In this case, the property should be selected in advance by ticking the top-left checkbox. Then click right-button pop-up menu Property>>Edit to pop up the attribute editing window. After the checkbox block has been ticked, the Filters groupbox is presented.
- Set the attribute in the block editing window. In this case, tick checkbox block directly in the property mapping table to pop up a block editing window where the Filters groupbox is presented.
Above two ways take same effects to define filtering rules. For a new-created property, the first way is better so that all attributes can be set at the same time.
In the Filters groupbox, radio buttons of Specific or General should be firstly pushed to select which type of rules is to be defined. If the former is selected, it is easy to set filtering rules via simple mouse clicking. On the other hand, operators should input an XPath expression to define a General rule.