Work flow processors | GooSeeker

Work flow processors

Fri, 09/05/2008 - 11:39 — Fuller

The following processors are supported in the current release:

Processor Name	Description
MigrateWorksBucket	If the extracted data are paginated, DataScraper has to turn over pages one by one, which is fulfilled by making use of the concept in-thread clue. That is, when data snippets are extracted from current page, an in-thread clue implying turning to next page is also extracted which is put in the global context container. When all processors has been scheduled once time, it means a circle has been finished. By then, DataScraper should determine if there is another circle to go by checking the global context container for an in-thread clue. If there is one, DataScraper will start another circle where the first processor is MigrateWorksBucket which migrates the in-thread clue from the global context container to its run-time memory. On the contrary, if there is not, DataScraper will stop this thread and may initiate another thread if there are more clues not being in type of in-thread waiting to be crawled.
FetchSpiderClue	This processor firstly tries to fetch an in-thread clue from the run-time memory which may be migrated in by MigrateWorksBucket. If there is not, it will try to fetch a clue from the DataStore server.
LoadHtmlPage	The processor loads the page targeted by the current clue.
FindDataSchema_Plain	The processor check if the current page's structure matches the data schema defined by MetaStudio. If it does, data and clue extraction instruction files will be loaded into run-time memory and the page will be extracted by the succeeding processors. Otherwise the clue's status will be changed to unknownschema on the DataStore server.
ExtractWebNodeData_Simp	The processor extracts data snippets from the current page according to the Data Extraction Instruction File(MAP file) and formats the result into a XML document which is to be stored in the global context container temporarily.
ValidateExtraction	The processor validates the result against the Data Schema Recognition Rule File(DSD file). If it didn't pass the test, the clue's status would be changed to unknownschema on the DataStore server and the result would be discarded.
SaveFile_Simp	The processor fetches the result file and sends it onto the DataStore server.
ExtractSpiderClue_Simp	The processor extracts clues from the current page according to the Clue Extraction Instruction File(SCE file) and stores the clues not in type of in-thread onto the DataStore server while the in-thread clue is stored in the global context container. During extraction, if a clue's theme name is different from the current theme, the processor will ask MetaCamp server to allocate a new record for the theme whose status would be torecognize if its data schema had not been defined.
ConfirmSpiderClue_Simp	The processor asks DataStore server to change status of the current clue to extracted because it has been crawled.
CleanWorksBucket	The processor cleans out-of-date data from the global context container without touching the in-thread clue.

Login to post comments
简体中文