The following processors are supported in the current release:
Processor Name | Description |
MigrateWorksBucket | If the extracted data are paginated, DataScraper has to turn over pages one by one, which is fulfilled by making use of the concept in-thread clue. That is, when data snippets are extracted from current page, an in-thread clue implying turning to next page is also extracted which is put in the global context container. When all processors has been scheduled once time, it means a circle has been finished. By then, DataScraper should determine if there is another circle to go by checking the global context container for an in-thread clue. If there is one, DataScraper will start another circle where the first processor is MigrateWorksBucket which migrates the in-thread clue from the global context container to its run-time memory. On the contrary, if there is not, DataScraper will stop this thread and may initiate another thread if there are more clues not being in type of in-thread waiting to be crawled. |
FetchSpiderClue | This processor firstly tries to fetch an in-thread clue from the run-time memory which may be migrated in by MigrateWorksBucket. If there is not, it will try to fetch a clue from the DataStore server. |
LoadHtmlPage | The processor loads the page targeted by the current clue. |
FindDataSchema_Plain | The processor check if the current page's structure matches the data schema defined by MetaStudio. If it does, data and clue extraction instruction files will be loaded into run-time memory and the page will be extracted by the succeeding processors. Otherwise the clue's status will be changed to unknownschema on the DataStore server. |
ExtractWebNodeData_Simp | The processor extracts data snippets from the current page according to the Data Extraction Instruction File(MAP file) and formats the result into a XML document which is to be stored in the global context container temporarily. |
ValidateExtraction | The processor validates the result against the Data Schema Recognition Rule File(DSD file). If it didn't pass the test, the clue's status would be changed to unknownschema on the DataStore server and the result would be discarded. |
SaveFile_Simp | The processor fetches the result file and sends it onto the DataStore server. |
ExtractSpiderClue_Simp | The processor extracts clues from the current page according to the Clue Extraction Instruction File(SCE file) and stores the clues not in type of in-thread onto the DataStore server while the in-thread clue is stored in the global context container. During extraction, if a clue's theme name is different from the current theme, the processor will ask MetaCamp server to allocate a new record for the theme whose status would be torecognize if its data schema had not been defined. |
ConfirmSpiderClue_Simp | The processor asks DataStore server to change status of the current clue to extracted because it has been crawled. |
CleanWorksBucket | The processor cleans out-of-date data from the global context container without touching the in-thread clue. |