FreeFormat

FreeFormat is an approach to semantically annotate Web pages. It is fully supported from MetaStudio V3. As a result, MetaSeeker toolkit can define data extraction rules and extract data from the Web more efficiently and more precisely. At the same time, the extraction rules are more robust against to the changes in the target Web pages. The word, FreeFormat, appears in a bulk of papers and books published by GooSeeker. The following are some major scenarioes.



FreeFormat Bucket

The word, bucket, is chosen by MetaSeeker Toolkit to express a structured container which specifies the data schema of a group of Web pages and stores extracted data from the pages. In the previous versions before V3, only one type of bucket, i.e. ListBucket, is provided, which is used to specify data schemas in form of two-dimension tables. Obviously, the capability of this bucket is poor. From version 3 and on, one new bucket, named as FreeFormat Bucket, is supported to specify data schemas in tree-like structures. The bucket has implemented the FreeFormat approach to annotate and extract Web contents. Obviously, the FreeFormat bucket is more powerful to specify data schemas and to extract data from the Web, because every Web pages are modelled with DOM trees, resulting in more straight-forward data mapping and extraction. At the same time, MetaStudio is easier to operate, which is shown in MetaStudio's User's Guide.



FreeFormat Property

Like all other types of bucket, a FreeFormat bucket is made up of a group of properties which are called FreeFormat properties and are organized into a tree. It is a convention to call the parts making up a tree as nodes. So FreeFormat node is an alias of FreeFormat property. There are the following two approaches to classify FreeFormat properties.

FreeFormat properties can be classified into the following two categories considering if they have FreeFormat marks:

  • ones without a FreeFormat mark: The property cannot be located via referring to a FreeFormat mark. Instead it is located with an absolute XPath expression. In order to extract data for this property, data mapping must be performed to define the data extraction rules;
  • ones with FreeFormat marks: The property can be located via referring to a FreeFormat mark. As a result, the data extraction rules are more robust against the changes in Web page's structures.

Alternatively, FreeFormat properties can be classified into the following two categories from point of view of the roles in a FreeFormat bucket:

  • container nodes: They represents a sub-container in a bucket. They always have one or more children nodes. The containers can be nested into unlimited levels. The top container is the bucket self.
  • leaf nodes: They are leaves on a tree. They cannot contain children nodes. They are holders of extracted data.



FreeFormat Mark

Some metadata in HTML documents, e.g. tags and attributes, can be viewed as marks of semantic annotations. The Freeformat approach can recognize the marks and anchor the annotations onto them. For example, HTML attributes of class and id are often recognized as the marks. In fact, FreeFormat approach is very efficient to recognize miroformats and makes fully use of them to reformat the Web pages and to extract data from them. Further more, FreeFormat is more extensible since not only microformats but also most of HTML attributes can be made use of by it. with the help of FreeFormat marks, MetaSeeker toolkit can efficiently extract data from the Web and the extraction rules are robust to the changes in the Web page's structures.



Types of FreeFormat marks

MetaStudio V3.x supports only two types of FreeFormat marks, i.e. class and id, both of which are the HTML attributes. When defining data extraction rules, either class or id can be chosen. In some cases, it may be better to choose neither. How to choose the marks are explained in detail in MetaStudio User's Guide V3.1.0.