Over the bucket structure tree, there is a right-button pop-up menu whose menu items are all for modifying the structure of the tree. If the bucket is created via pushing the button newBckt, most of the subsequent operations are to insert properties. In contrast, if the bucket is created via recognizing FreeFormat, the subsequent operations are to tailor the tree.
All operations to tailor and modify the tree are initiated via clicking the right-button pop-up menu items over the selected property.
Tips: The bucket structure tree, the DOM tree viewer and the embedded browser are correlated by every DOM nodes. If a property marked with a FreeFormat is selected, the DOM node in the DOM tree viewer will be selected automatically. At the same time, the corresponding element in the embedded browser will be highlighted with a flashing red border.
A few commonly used operations are to be stated in the following. Others can be found in MetaStudio Senior User's Handbook.
Creating a new property can be initiated by selecting the pop-up menu item, over the bucket structure tree, Create. There are the following three sub menu items:
After any menu items stated above has been clicked, a dialog window pops up, where the user can set the attributes for the property.
key: is a checkbox to specify if the data should always appear on the target page. If it does, a corresponding data schema recognition rule is recorded into the Data Schema Recognition Rule File. If the required data cannot be located when checking against the rule, the page is considered unrecognizable. In summary, assigning key attributes for appropriate properties can improve the preciseness of extraction. At the same time, the robustness of the data and clue extraction rules against changes in Web page structure can also be improved because more conditions are checked when extracting data.
Tips:
block: is a checkbox to specify if an HTML segment is to be extracted from the target page. If it is ticked, a dialog showing detailed options pops up. The options act as filters on how to extract the segments. For example, the filter All is used to extract the whole HTML segment while the filter Image is used to extract all IMGs within a scope. Please refer MetaStudio Senior User's Handbook for detailed information.
Notes: If the bucket is created via recognizing FreeFormat, the attribute block of every leaf node is ticked by default. The option is Text, meaning all textual contents are to be extracted within the scope delimited by mapped-from DOM node. In the case that the wanted textual contents are mixed with many useless ones, the attribute block should be cancelled. Instead, Map Data should be performed to extract data more precise.
In most case, if the bucket is created via recognizing FreeFormat, there may be many irrelevant nodes which should be deleted. Following the step stated in Recognizing FreeFormat, the node named as box1 is irrelevant to the theme. After having selected the node, click the right-button pop-up menu item Delete. A dialog pops up to ask if the whole sub-tree should be deleted as a whole. Alternatively, only the selected node will be deleted and all its children are shifted to be the children of deleted node's parent.
During recognizing FreeFormat, the properties are named with the values of class and id, the HTML attributes. Most of the time, the values are meaningless in semantic. It is better to change them to meaningful names. Another reason to change property names is that duplicate property names are not permitted within a bucket. Ordinarily, there might be many duplicate values of attributes of class and id. They should all be renamed if they are recognized as FreeFormat properties.
In the property edit area on the top of bucket structure tree, the textbox named as property is for editing property's name. Whenever the GUI focus is moved away, the new name is accepted. Alternatively, the user can double click the property on the bucket structure tree to pop up a property editting dialog where the new name can be input into a similiar textbox.
Currently, only class and id are used as FreeFormat marks. If a DOM node has both id and class, id is selected by default. There is a field, named as Type, on the bucket structure tree to denote the type which can take one of the following values:
Notes: id implies uniqueness despite there might be multiple ids with same value if the document would contain mistakes. Being one type of FreeFormat mark, id works good if the property marked with it is instantiated as a singleton during extracting data from the target page. In contrast, if multiple instances are to be extracted for this property, id should not be used.
There are the following operations to manipulate FreeFormat marks: