DataScraper Overview

DataScraper is a tool continuously extracting data from the Web, which is instructed by the data extraction rules generated by MetaStudio and recorded in Data and Clue Extraction Instruction Files. The resulted XML-formated files are stored onto DataStore server. DataScraper provides a Harvest Manager and an Index Manager based on Lucene v2.3.2 engine, both of which have GUIs for easy use.

DataScraper is one of the four tools in MetaSeeker toolkit.



DataScraper is an advanced screen scraper

Screen scrapers extract required information from target Web pages and filter out noise information. The core of a screen scraper is an HTML wrapper which transform unstructured target Web pages into structured ones. Thereafter the required information is picked from the structured content. Because the formats of Web pages change vastly from one site to another, there must be a great amount of wrapper against each targets respectively. As a result, it costs much for information service providers to implement and maintain so many wrappers which are coded in all kinds of programming languages and inherit little from the previous.

MetaSeeker toolkit casts a light on saving operation cost and on letting service providers to focus more on core business. With the help of advanced computer and software technology, MetaSeeker is structured modularly so that the data extraction procedures are distributed onto two components, i.e. one for data schema definition and one driven by a workflow engine for Web data extraction. The former is packaged into MetaStudio, and the latter DataScraper. Operators are totally released from coding HTML wrappers for different target pages. All what they do are redefining or defining from sketch the data schemas for target pages through GUI of MetaStudio. Then MetaStudio generate new data extraction rules automatically which are fed into DataScraper to initiate new extraction process. Operators can edit data schemas and data extraction rules just like editing a document via an GUI-based editor. More detailed advanced features are described on page MetaSeeker Introduction.




Resources

  1. If you want to know how to deploy DataScraper, please visit MetaSeeker Installation Guide;
  2. If you want to learn how to operate DataScraper please visit DataScraper User's Guide;
  3. If you are trying to extract product list or yellow pages, please follow the steps shown on page MetaSeeker Cook Book#Scenario 1 and Scenario 2;
  4. If you want to learn more inside MetaSeeker, please visit Inside MetaSeeker.