Tools used: DataScraper, a Web data and clue extraction tool.
The following steps are taken to extract clues with DataScraper:
Following figure show the GUI before submitting crawl command.
EnlargeAfter having extracted clues for commodity categories, the status of theme ComYellowPage_mic_en can be viewed via clicking right-button pop-up menu item Statistics.
On most eCommerce sites or yellow page sites, all categories and all their sub-categories are listed on one single page so that all clues can be extracted from this page by running only one circle of data extraction work-flow. In contrast the tree-like data structure standing for the relation between a category and one of its sub-category is hard to be extracted because MetaStudio have only one type of bucket, i.e. ListBucket.
This sample site is an exception. Only top categories are listed on this page. Along a link for one category, a page containing the second level categories can be visited. As a result, the target theme of the clues extracted in current page is named as ComYellowPage_mic_en_l2 which means the theme for the second level categories. The steps to define clue extraction rules stated in this phase should be repeated once more for the new theme, which are stated in Appendix A.
After the sub-category pages having been extracted, a lot of clues belonging to theme ComList_mic_en are created, along which commodity lists can be extracted further. Phase 2 is to show how to extract commodity lists.