Recognize FreeFormat

The first step to define data extraction rules is to create a bucket. As stated in Work board layout, a new bucket is created via pushing the button newBckt. Thereafter, one property by one can be inserted into the bucket manually. This section introduce a shortcut to build a FreeFormat bucket.

In the document on a Web page, there are many HTML elements with attributes class or id. Although the attributes, working with CSs, are for particular presentation effects, they imply some semantics which can be made use of to build up a FreeFormat bucket when defining a data schema, which is the purpose of recognizing FreeFormat. In the extreme, if the document is natively formatted into a kind of semantic structure, e.g. Microformat, only a few mouse clicks are needed to define a data schame through the way of recognizing FreeFormat.



How to

Following the step stated in section Name the theme, enable reverse selection and locate a DOM node holding the information about a company, e.g. the literal string “Shanghai Sunshine Int'l Tdg...”, via clicking the string directly on the embedded browser of MetaStudio. Thereafter, a specific DOM node is selected automatically in the DOM tree viewer. Attention should be paid that the GUI focus may not be changed to the selected DOM node in the tree window. The rollbar may be dragged to find the selected row. The selected node may be double-checked to make sure the node is just the one wantted because in some cases the automatically selected node might be an ancestor of the wanted node, which results in unprecise mapping if not fixed. In this example, a DIV node with class value of "itemBox nobox2" is selected. Over the row in the DOM tree viewer, click the right button of the mouse to pop up the menu and select the item Recognize FreeFormat. Input the name of the new created bucket, e.g. company and optionally tick the checkbox under the textbox holding the bucket name to make the recognized tree root to be the top container. After the operations have been submitted, a folded bucket structure tree will be displayed on the Bucket Editor work board.

Notes: When recognizing FreeFormat marks and finding one single tree, the checkbox to make the recognized tree root to be the top container will be enabled. After having ticked it, the recognized root is turned to be the top container of the bucket. Otherwise, the root is put into the top container as a child. Whether or not the checkbox is ticked does not affect the performance and precise of the data extraction engine. The difference appears only in the extraction result files where one more nested XML tags are recorded if the checkbox has not been ticked.

After recognization, the new bucket may contain many useless FreeFormat marks. Of coure, there is the case that all recognized properties are just wanted. The content is annotated with microformats is one of the case. The following section will show how to tailor the tree.

Notes: If too many useless properties were recognized, recognization is not a suitable way to create a new bucket because it will cost too much time to delete the useless nodes from the tree. Alternatively, it is better to create a new bucket via pushing the button newBckt and insert necessory properties manually.