Scenario overview

In this scenario there are three sequential phases every of which belongs to a specific theme. In the first phase, a lot of clues are extracted for the second phase. Same does the second phase. As a result MetaSeeker crawls the Web wider and wider as a spider does in the nature. Where MetaSeeker can go is determined by operators during defining data schemas and data and clue extraction rules.

Firstly, operators must find a suitable entrance to crawl a specific site, which determines how far MetaSeeker can go. In fact the entrance is easy to find because every site wants all visitors including crawlers to easily find its resources as more as possible.

Forums, blogs, yellow pages and news portal all have such entrances, normally called as portals, e.g., pages listing all active forum topics, all new blog entries, all news titles or categories of commodities and business entities. Now almost all sites should be friend to search engines, it is straight-forward for MetaSeeker to find such entrances since MetaSeeker is a special Web crawler.

There are still a few sites or regions within a site designed to be visited by human beings instead of by Web crawlers. For example, challenging is triggered when visiting a page. MetaSeeker cannot visits the pages in this kind.

In the following chapters, three phases will be stated to extract data on commodities and business entities.