Most B2B, B2C, C2C or yellow page sites provide a page classifying the information into many different categories which is just the entrance to crawl the sites as discussed in previous section. Clues are firstly extracted from every category items, along which commodity lists or business entity lists in specific categories can be extracted further by MetaSeeker. The general phases to extract all business information from a site in this kind are shown as follows:
This chapter focuses on phase 1. Other phases are stated in Phase 2: extract catalog and Phase 3: extract detailed information. The target site is http://www.made-in-china.com
Note: Currently MetaStudio provides only ListBucket which describes data schema for a 2-dimension table. On category pages, the categories are stored in a tree where sub-trees represent sub-categories. Since ListBucket is not suitable to extract information on relations between categories and their sub-categories, only clues are extracted.
The following steps are taken in this phase: