Monitor status

DataScraper provides a set of facilities to monitor working status.

Logs

In the Output region, there is a list presenting messages or warnings. There are five columns as follows:

  • Time: is log entry's emission time.
  • Level: There are fours levels represented by four digits as follows:
    1. Debug: numbered 1, means the log entry is for debug purpose.
    2. Information: numbered 2, means something should be known to operators.
    3. Warning: numbered 3, means something unexpected happens without impacting DataScraper's running.
    4. Error: numbered 4, means some actions should be taken by operators to handle the fault.
  • Clue Id: is an id allocated by DataStore's database manager to a SpiderClue record.
  • Processor Name: is the name of the processor emitting the log message.
  • Message: is message body.

Log entries for selecting data schema

If multiple data schemas have been defined for a theme, DataScraper should select one matching the structure of current page. There are a few log messages recording the selecting process. The messages may be presented in the following format:

The AAAth validating rule in BBB didn't pass in CCCst inthread cycle

where:

  • AAA is the serial number of the data schema being tried.
  • BBB is the name of Data Schema Recognition Rule File for this data schema.
  • CCC is the page number where data schemas is selected again.

In case there are multiple data schemas for a theme, log messages like above do not mean fault. Only if all data schemas failed to be tried, the following log message would be emitted:

Suitable schema file(dsd) cannot be found for this SpiderClue in CCCst inthread cycle

where CCC is the page number where data schemas has been tried.

The above message says all data schemas was not matched with current page's structure. As a result, data extraction was not performed over the page. The operator can find the status of the SpiderClue record with this id has been set to unkownschema. The operator can load the page manually into MetaStudio to analyze its data structure. Maybe one more data schemas should be defined for this theme over this sample page.

Note: In this release, there is not a GUI-based approach to query the SpiderClue record in status of unknownschema. Operator should access MySQL database system to query the record with its id.



Progress of data extraction task

The progress of data extraction task is shown on Status panel of DataScraper. The following are displayed:

  • Theme: the theme name.
  • Start: start time
  • Total Clues: total clues to be crawled
  • Left: left clues to be crawled. If the number becomes 0, the task finishes.
  • MetaCamp Server: the connection status to MetaCamp server. If the image is presented, the connection is ok.
  • DataStore Server: the connection status to DataStore server. If the image is presented, the connection is ok.