Data Extraction Pattern

A data schema, specifying the semantics of data blocks on target pages, is used to generate a series of data and clue extraction rules. But in some complex cases, the data blocks with same semantics would be presented in different formats on the same page. Let's give an example of product list on a eCommerce site. On one page, some products might show their weight and other might not. If the property weight is a key, i.e. always existing in result files, the rules might fail when encountering a product without a weight property. In order to resolve the problem, multiple Data Extraction Patterns are defined. It implies that one Bucket takes multiple different presentation patterns. The bucket's properties for each patterns might have different attributes. For above example, property weight 's attribute null takes value FALSE in one pattern while the same attribute takes value TRUE in another pattern.

Multiple Data Extraction Patterns are only used in case that there are different presentation formats for data blocks with SAME semantics, i.e. one single data schema, on one same page. The case should be separated from that there are multiple data blocks with different semantics for which multiple data schemas should be defined.

GooSeeker

Documentation

Data Extraction Pattern

Languages