There are many aliases for Web crawler, for example, Web spider, Web robot or Web worm. While the tool DataScraper in MetaSeeker acts just like a Web crawler, MetaSeeker toolkit is different from an ordinary Web scrawler.
At the beginning, an ordinary Web scrawler loads a Web page which is the entry for it to crawl a specific area of the Web. After having manipulated the page, e.g. stored the page on hard disks, it tries to find all clues, i.e. links (or hyperlinks, or Web links), which can direct it to more HTML pages further. Alternatively, how far the found clues go, i.e. crawling depth, is determined by user of a scrawler. When crawling further and broader, it just like a real spider crawling on a real web. If you ask a Web crawler what it has found, it must tell you that it has a great amount of data and clues. Unfortunately it doesn't know what about the data. If you only want to setup a information retrieval system, e.g. a search engine based on full-text indexing, the crawler can satisfy all your requirements to downloading HTML pages as more as you can. On the contrary, if you want to manipulate the information on the pages differently according to their meta data(what about the data), the ordinary Web crawler can do little for you.
On the contrary, MetaSeeker tries to define and recognize the data schema(or data structure, or meta data) of the Web pages, which is specified with XML tags in Data Schema Specification Files. It should be kept in mind that the schema only uncovers what about the data instead of the meaning of the data, i.e. what IS the data. You know, MetaSeeker is not an artificial intelligent toolkit. What it can do is helping users to define and recognize data schema of Web pages since it provides a friendly GUI and a lot of facilities. It cannot recognize data schema autonomously. Instead users should tell it what is about what. While it is too simple compared to some solutions based on artificial intelligent technology, it is one of the most effective, flexible and robust products.
In summary, while MetaSeeker need user's directions on specifying what is about which data on Web pages, the extracted data are well formatted with meta information which can direct computers to manipulate the data exactly and semantically. For example, the extracted data can be easily transformed and aggregated into HTML or XML documents in different structures using a XSLT engine. Another example is that the extracted data can be aggregated and presented onto portals or mashup services automatically. None of the above can be fulfilled by an ordinary crawler.
Tips: MetaSeeker can also download whole Web pages onto hard disks as an ordinary Web crawler does. But MetaSeeker cannot keep the folder hierarchy of the target site unchanged, which is a feature of some Web crawlers for mirroring target sites.
HTML wrapper is a must for data extraction from Web, which transforms the original HTML pages and filters out useless data. A template may be fed into a wrapper to direct it on how to transform the target page. Since Web is a vast repository of HTML pages with widely different meaning and format, a great amount of wrappers are implemented each of which is for a specific Web site or even for a specific Web page respectively. The situation becomes more complicated when the wrappers are implemented via all kinds of programing languages. We programmers are always re-inventing a wheel. Therefore it is a good idea to implement a factory to generate a series of wrappers. MetaSeeker can just act as a wrapper factory which possesses many differentiated characters as follows: