Steps

Having implemented FreeFormat, the process to define data extraction rules has been greatly simplified. If the target Web page has a formal semantic structure natively, e.g. ones with microformats embedded, only a few mouse clicks are needed to define the data extraction rules. At the same time, the performance and robustness of MetaSeeker toolkit have been improved sharply. As a result, FreeFormat bucket has totally taken the place of ListBucket in the current release, V3.1.0.

FreeFormat bucket has a tree-like structure, each of the tree node, representing a property, can be put into the following two categories:

  • container node: has one or more children.
  • leaf node: has no children.
In the following sections, we take the data schema named as Product_Category_mdiChina/default as an example to explain the features of a FreeFormat bucket.

  • Containers can be nested.
  • Including its children, a container in a specific nested level can be viewed as a sub-block of which the whole semantic content block is made up.
  • Multiple instances or single one of the sub-blocks, corresponding to a container in a specific nesting level, can be extracted from a target Web page.
  • Single instance means there is only one content block with the specified strucutre on the target Web page. For example, the top container, category, in the bucket named as Product_Category_mdiChina is a single-instance container.
  • Multiple instances mean there are multiple content blocks with the specified structure on the target Web page. For example, the inner container, d1, has multiple instances.
  • Leaf nodes are always singly instantiated.
  • In order to extract multiple instances, the following approaches should be taken:
    • assigning a FreeFormat mark in type of class to a container by performing Map FreeFormat, since there may be multiple HTML elements with the same class attribute.
    • enabling replicas and performing Map Replica. Replica is a proprietary appoach of MetaSeeker to calculate duplication parameters of multiple instances. One container node is mapped twice from two sibling DOM nodes. Please refer to MetaStudio User's Guide#Map Replicas for detailed information.

Replica has been implemented from MetaStudio V2.x to extract multiple instances, e.g. multiple products on a product catalog page. In this example, two sibling products, usually the 1st and 2nd products, should be mapped to the same property. Thereafter MetaStudio calculates out the duplication parameters, i.e. the start point and the period, which are used to generate data extraction rules. Obviously, the approach is very complicated because every property should be mapped twice. At the same time, the capability is constrained to extract two-dimension tables. From V3.x and on, replica has been optimized in FreeFormat bucket so that not every property but container nodes should be mapped twice only if multiple instances are to be extracted with the replica approach. At the same time, not only two-dimension tables but also trees can be extracted with the help of FreeFormat bucket, which is to be stated in the following chapters.