Map property

After the bucket has been built up, MetaStudio should be told from which DOM node the data will be extracted for a property. We call this process mapping or property mapping compared to clue mapping which is to be stated in the next chapter.

In order to improve the robustness of the data extraction rules against changes in Web page structures, instead of locating a DOM node absolutely, MetaSeeker takes the following three approaches to locate it:

  • Data mapping is to map a DOM node holding the data to be extracted to a property directly. That is, the data is located absolutely. The extraction rules generated via this approach are prone to failure when the page structure has been changed.
  • FreeFormat mapping is to map a DOM node with FreeFormat marks to a property. The data may not be extracted directly from this DOM node. It acts mainly as a reference. Data mapping can be performed in parallel to tell MetaSeeker where to find the data exactly. This approach is more robust.
  • Replica mapping is to map two sibling DOM nodes to one property for calculating the duplication parameters of multiple instances to be extracted. The properties must be container nodes.

All above operations are initiated via clicking right-button pop-up menu items over the DOM tree viewer, which are stated in detail in the following paragraphs.



Map data

The right-button pop-up menu Map Data over the DOM tree viewer is constructed dynamically. All sub menu items correspond to the leaf nodes in the bucket. After a DOM node has been selected in the DOM tree viewer, click one of the menu items to map this DOM node to the corresponding property. As a result, the column node in the property edit area displays the serial number of the DOM node.

There are the following methods to select a DOM node in the DOM Tree Viewer:

  • Ordinary selection: Expand the DOM tree one level by another till the target node is found.
  • Reverse selection: In the embedded browser, just click on the target data snippets, the DOM tree will be expanded and the corresponding DOM node will be selected. By default, this function is not enabled. In order to enable it, tick the Reverse Selection checkbox on the tool bar. Thereafter, the handler for mouse clicking event is overriden by the customized one which positions the target HTML nodes on the tree.

Despite text nodes are mapped to properties in most cases, other nodes, e.g. elements and attributes, can also be mapped. Which type of node can be mapped is determined by the attributes of the target property, which is stated in detail in MetaStudio Senior User's Handbook. If the mapping was invalid, a alert window would pop up to show the reason. If the root reason can not be found yet, please go to MetaSeeker Toolkit forum to ask help from the community or contact us directly.

Note: Nesting of HTML nodes may impact precision of positioning by reversion selection. In some cases the found node may be an ancestor of the target node. In this case, the user must make sure the node is the wanted. MetaStudio provides such a convenient tool helping the user to verify it that he just watches whose border flashes in red for three times.



Map FreeFormat

If the DOM node from which the data will be extracted has FreeFormat marks, FreeFormat mapping can be performed from this node to the property. Alternatively, if the DOM node hasn't a FreeFormat mark but one of its ancetor has, the ancestor node can be mapped from. Whether the property is a container node or not, the mapping operation can be performed.

Notes: When finding FreeFormat marks, the scope on the Web page is limited. Every container in the bucket represents a block on the page which is delimited by the outermost element. It is the scope to be searched. If the user maps from a DOM node out of this scope, an alert window will pops up during calculating data extraction rules by MetaStudio.

Take one of the two approaches, stated in the above section, to select a DOM node. Click Map FreeFormat, one of menu items poping up via clicking right-button over the DOM tree viewer, to pop up the sub menu which is made up of the names of the properties. Click one of the sub menu items to map the FreeFormat mark on the selected DOM node to the property. On the bucket structure tree, the fields FreeFormat and Type are filled with the value and the type of the FreeFormat respectively.

After having mapped FreeFormat, data mapping can still be performed over the property. As stated before, FreeFormat marks, acting as references, are helpful to precisely extract data. If data mapping has not be performed, FreeFormat mapping operation automatically set the block attribute for the property, where the filter is Text meaning all textual content embraced by this element will be extracted. If being afraid of extracting too many useless contents, the block attribute can be cancelled and perform exact data mapping.



Map replicas

As stated before, i.e. the words in italic in the section Steps, the approach to map replicas has changed greatly. Currently, only container node can be mapped for extracting multiple instances. If FreeFormat marks can be mapped to the container nodes, they are preferred for improving robustness of the data extraction rules.

If replicas are to be used to extract multiple instances, they should be enabled for a specific property in advance via taking the following steps:

  • Select the container node on the bucket structure tree
  • Click mouse's right button over the area of Replica Management to pop up a menu. If replicas have not be enabled, the menu label is Enable 2nd. Otherwise, the label is Disable 2nd. In this case, the label should be Enable 2nd. When having clicking it, the color of the area changes, meaning replicas being enabled.
  • On the DOM tree viewer, select the DOM node representing the first instance. Click mouse's right button and click the pop-up menu item Map Replica->First. As a result the serial number of the DOM node will be displayed at the end of the label, 0, of the first radio button.
  • On the DOM tree viewer, select the sibling DOM node representing the second instance. Click mouse's right button and click the pop-up menu item Map Replica->Second. As a result the serial number of the DOM node will be displayed at the end of the label, 1, of the second radio button.