Phase 3: extract detailed information

During extracting data and clues as stated in Phase 2, a new theme Product_mic_en is created. Along the clues belonging to this theme, pages on detailed products information can be visited and extracted. This phase is the following of Phase 2. Because the steps to define a data schema and data extraction rules are same as those stated in Phase 2. Only some significant steps are stated in this chapter.

Recognize the theme

On the Theme List work board, select the row of Product_mic_en and click right-button pop-up menu item recognize to load a sample page. Figure 1 shows the Theme Editor work board after clicking the menu.


Figure 1 (Enlarge)

Tips: When the theme list grows longer and longer, finding a specific theme in sequence is very time-consuming. MetaStudio has a query feature with which a sub-set of themes can be found. Query conditions can be exact theme names or strings with wildcard character *, e.g. *Page*, ComPage* or *Page.



Define data and clue extraction rules

Two buckets are created. One is for detailed product information and another for titles and clues for other products. There is only one replica for the former. In contrast, there are two replicas for the later because the later is in format of list. Figure 2 shows the Bucket Editor work board for bucket ProductInfo after mapping and figure 3 for bucket OtherProduct. Figure 4 shows information on Info clue created when defining bucket OtherProduct.


Figure 2 (Enlarge)



Figure 3 (Enlarge)



Figure 4 (Enlarge)


In bucket ProductInfo, property description's attribute block is set and is in type of Text which means all text will be extracted within a scope.