Modify the structure of the FreeFormat bucket

Over the bucket structure tree, there is a right-button pop-up menu whose menu items are all for modifying the structure of the tree. If the bucket is created via pushing the button newBckt, most of the subsequent operations are to insert properties. In contrast, if the bucket is created via recognizing FreeFormat, the subsequent operations are to tailor the tree.

All operations to tailor and modify the tree are initiated via clicking the right-button pop-up menu items over the selected property.

Tips: The bucket structure tree, the DOM tree viewer and the embedded browser are correlated by every DOM nodes. If a property marked with a FreeFormat is selected, the DOM node in the DOM tree viewer will be selected automatically. At the same time, the corresponding element in the embedded browser will be highlighted with a flashing red border.

A few commonly used operations are to be stated in the following. Others can be found in MetaStudio Senior User's Handbook.



Create a new property

Creating a new property can be initiated by selecting the pop-up menu item, over the bucket structure tree, Create. There are the following three sub menu items:

  • Before: A new property is inserted before the selected node. The current and the new properties are siblings.
  • After: A new property is inserted after the selected node. The current and the new properties are siblings.
  • As Child: A child property is created for the selected node. If the selected node is a container node originally, the new property is put in the first place. Otherwise, the selected node is turned to be a container node.

After any menu items stated above has been clicked, a dialog window pops up, where the user can set the attributes for the property.

  • property: is a textbox through which the name of the property is input. It is recommended to use a meaningful literal string to name a property. The characters making up the string are those permitted to name XML tags. Additionally, space characters are permitted too, while the spaces are replaced by underscores when they are recorded into the data and clue extraction instruction files and the extraction result files, which should be paid attention when processing the extraction results.
  • key: is a checkbox to specify if the data should always appear on the target page. If it does, a corresponding data schema recognition rule is recorded into the Data Schema Recognition Rule File. If the required data cannot be located when checking against the rule, the page is considered unrecognizable. In summary, assigning key attributes for appropriate properties can improve the preciseness of extraction. At the same time, the robustness of the data and clue extraction rules against changes in Web page structure can also be improved because more conditions are checked when extracting data.

    Tips:

    • If the key attribute is set for a property, all key attributes of the nested containers holding this property will be set automatically.
    • If the key attribute is to be cancelled for a property, a dialog pops up asking if the key attributes of all nested containers holding this property should be cancelled too.
    • The key attribute of a property can be cancelled only if all key attributes of the nested containers holding this property have already been cancelled.

  • clue: is a checkbox to specify if a clue should be created over the extracted data. It it is ticked, an Info clue is automatically added onto the Clue Editor work board.
  • url: is a checkbox to specify if the extracted data should be viewed as an URL. The checkbox is ticked automatically if clue is ticked. If only the URL parts of path name and resource name are extracted from the target page, the parts of protocoal name, e.g. http, and domain name should be assembled to make an integrated URL address by extraction result manipulation software. It should be paid attention that they are NOT automatically assembled by the Web data extractor DataScraper. The result manipulation software should retrieve Data Structure Specification File(GEM) for this data schema and get the type value of the property from it. In the GEM file, the type value is link.
  • block: is a checkbox to specify if an HTML segment is to be extracted from the target page. If it is ticked, a dialog showing detailed options pops up. The options act as filters on how to extract the segments. For example, the filter All is used to extract the whole HTML segment while the filter Image is used to extract all IMGs within a scope. Please refer MetaStudio Senior User's Handbook for detailed information.

    Notes: If the bucket is created via recognizing FreeFormat, the attribute block of every leaf node is ticked by default. The option is Text, meaning all textual contents are to be extracted within the scope delimited by mapped-from DOM node. In the case that the wanted textual contents are mixed with many useless ones, the attribute block should be cancelled. Instead, Map Data should be performed to extract data more precise.



Delete properties

In most case, if the bucket is created via recognizing FreeFormat, there may be many irrelevant nodes which should be deleted. Following the step stated in Recognizing FreeFormat, the node named as box1 is irrelevant to the theme. After having selected the node, click the right-button pop-up menu item Delete. A dialog pops up to ask if the whole sub-tree should be deleted as a whole. Alternatively, only the selected node will be deleted and all its children are shifted to be the children of deleted node's parent.



Rename a property

During recognizing FreeFormat, the properties are named with the values of class and id, the HTML attributes. Most of the time, the values are meaningless in semantic. It is better to change them to meaningful names. Another reason to change property names is that duplicate property names are not permitted within a bucket. Ordinarily, there might be many duplicate values of attributes of class and id. They should all be renamed if they are recognized as FreeFormat properties.

In the property edit area on the top of bucket structure tree, the textbox named as property is for editing property's name. Whenever the GUI focus is moved away, the new name is accepted. Alternatively, the user can double click the property on the bucket structure tree to pop up a property editting dialog where the new name can be input into a similiar textbox.



Select the type of FreeFormat

Currently, only class and id are used as FreeFormat marks. If a DOM node has both id and class, id is selected by default. There is a field, named as Type, on the bucket structure tree to denote the type which can take one of the following values:

  • +id-class: denotes id is selected while both id and class are presented.
  • -id+class: denotes class is selected while both id and class are presented.
  • +class: denotes only class is presented and selected.
  • -class: denotes only class is presented and not selected yet.
  • +id: denotes only id is presented and selected.
  • -id: denotes only id is presented and not selected yet.
  • null: denotes neither id nor class is presented.

Notes: id implies uniqueness despite there might be multiple ids with same value if the document would contain mistakes. Being one type of FreeFormat mark, id works good if the property marked with it is instantiated as a singleton during extracting data from the target page. In contrast, if multiple instances are to be extracted for this property, id should not be used.

There are the following operations to manipulate FreeFormat marks:

  • Clear FreeFormat: is initiated via clicking the right-button pop-up menu item over the bucket structure tree. The operation frees the property from all FreeFormat marks if none of them are suitable. Thereafter Data Mapping should be performed to define a rule to extract data from a specific DOM node to this property.
  • Map FreeFormat: is initiated via clicking the right-button pop-up menu item over the DOM tree viewer to map a DOM node with FreeFormat marks to the property. So MetaStudio can generate an extraction rule which extracts data by referring to this mark. The next section will state it in detail.
  • Select a FreeFormat mark: is initiated via double clicking the node on the bucket structure tree and tick an appropriate checkbox denoting a specific mark. If neither checkboxs are ticked, the FreeFormat is cleared.