Work flow processors

The following processors are supported in the current release:

Processor Name Description
MigrateWorksBucket If the extracted data are paginated, DataScraper has to turn over pages one by one, which is fulfilled by making use of the concept in-thread clue. That is, when data snippets are extracted from current page, an in-thread clue implying turning to next page is also extracted which is put in the global context container. When all processors has been scheduled once time, it means a circle has been finished. By then, DataScraper should determine if there is another circle to go by checking the global context container for an in-thread clue. If there is one, DataScraper will start another circle where the first processor is MigrateWorksBucket which migrates the in-thread clue from the global context container to its run-time memory. On the contrary, if there is not, DataScraper will stop this thread and may initiate another thread if there are more clues not being in type of in-thread waiting to be crawled.
FetchSpiderClue This processor firstly tries to fetch an in-thread clue from the run-time memory which may be migrated in by MigrateWorksBucket. If there is not, it will try to fetch a clue from the DataStore server.
LoadHtmlPage The processor loads the page targeted by the current clue.
FindDataSchema_Plain The processor check if the current page's structure matches the data schema defined by MetaStudio. If it does, data and clue extraction instruction files will be loaded into run-time memory and the page will be extracted by the succeeding processors. Otherwise the clue's status will be changed to unknownschema on the DataStore server.
ExtractWebNodeData_Simp The processor extracts data snippets from the current page according to the Data Extraction Instruction File(MAP file) and formats the result into a XML document which is to be stored in the global context container temporarily.
ValidateExtraction The processor validates the result against the Data Schema Recognition Rule File(DSD file). If it didn't pass the test, the clue's status would be changed to unknownschema on the DataStore server and the result would be discarded.
SaveFile_Simp The processor fetches the result file and sends it onto the DataStore server.
ExtractSpiderClue_Simp The processor extracts clues from the current page according to the Clue Extraction Instruction File(SCE file) and stores the clues not in type of in-thread onto the DataStore server while the in-thread clue is stored in the global context container. During extraction, if a clue's theme name is different from the current theme, the processor will ask MetaCamp server to allocate a new record for the theme whose status would be torecognize if its data schema had not been defined.
ConfirmSpiderClue_Simp The processor asks DataStore server to change status of the current clue to extracted because it has been crawled.
CleanWorksBucket The processor cleans out-of-date data from the global context container without touching the in-thread clue.