Data and clue extraction from Web pages

After having defined the data schema for a Web page, MetaStudio creates a series of Data and Clue Extraction Instruction Files(DCEIF) and Data and a Clue Extraction Workflow File(DCEWF). The former tell DataScraper where and how to extract data and clues and the latter drives DataScraper's workflow engine. The workflow engine runs one circle and another the count of which is determined by the operator when starting the engine.. In every circle, DataScraper loads a target page at the beginning, the URL of which is retrieved from the clue to be crawled. Then workflow runs according to the DCEWF to extract data and more new clues from the target page. The extracted data are stored onto the DataStore server in format of XML file and the extracted clues are stored into the database on the server.

GooSeeker

Documentation

Data and clue extraction from Web pages

Languages