Steps to define a data schema

If users want to extract Web data with MetaSeeker toolkit, they should define a data schema against a sample page and generate data and clue extraction rules which are uploaded onto MetaCamp server in form of data and clue extraction instruction files.

The instruction files contain a series of commands which direct DataScraper to extract data continuously from the Web. Despite the extraction process is much like that of crawling the Web by an ordinary crawler, MetaSeeker can precisely extract data snippets and clues from the Web and store the result semantically into XML files, which needs precise data and clue extraction rules to direct DataScraper. In order to define a data schema and to generate the data and clue extraction rules, the following steps should be taken:

  • Choose a sample page and load it;
  • Name the theme;
  • define a data schema and data extraction rules
  • define clue extraction rules
  • define schema recognition rules

The process to define a data schema for a sample page is also the process for users to analyze and to understand the semantics of the page. Since the size of the page may be very large so that the HTML document contains a great amount of elements and attributes, it is still a complex work for the user to analyze it despite MetaStudio provides many convenient tools. As a result, before fully understanding the semantics, user should run a few circles of "analyze-verification-reanalyze". How many circles are run depends on the degree of complexity of page's semantical structure.