Define data extraction rules

Data extraction rules instruct DataScrapers on how to locate data snippets on target Web pages and on how to format and store the results. After the user has defined the data extraction rules, MetaStudio transforms them into XML program codes stored in data extraction instruction files which are consumed by DataScraper.

There are the following major operations to define data extraction rules:

  • Map from data snippets on the HTML page to properties. It is to tell MetaStudio where to locate the data snippets and into which properties in the bucket to store them. All the mapping rules are transformed into XPath expressions and XSLT instructions which are stored in the data extraction files.
  • Specify the attributes of the properties. The attributes can instruct DataScrapers to extract data more precisely.
  • Set up the rules for recognizing Web pages against a specific data schema.
On the subquence of the above operations, MetaStudio generates Data and Clue Extraction Instruction Files(MAP and GEM files) and upload them onto the DataStore server.

Note: From V3.x and on, FreeFormat has been implemented. As a result, bucket type of ListBucket has been replaced by that of FreeFormat on the Bucket Editor work board. ListBucket is only provided by MetaSeeker Enterprise for maintaining the legacy Data Schemas created with the former versions. If you are a user of MetaSeeker Enterprise, please refer MetaStudio User's Guide V2.0.