Add or delete properties

When a bucket has just been created, there isn't any property. Users should add properties into this bucket. Later the properties can be modified and deleted accordingly.

Over the property mapping table, the left column of the work board, there is a two-level right-button popup menu. In current, there is only one menu item on the first level. It is named as:

  • Property: By moving the cursor over it the second level menu is expanded through which the user can create, edit and delete a property.

Add a new property

Through clicking menu item Property->Create, a new property is created and a dialog pops up in which the user can edit the attributes of this property. They are:

  • property name: Each property must be given a name which is a character string implying the semantics. The string should be made up of multiple English letters both in upper and lower cases. White spaces are permitted between the words. But the white spaces are replaced with underscores in the data and clue extraction instruction files and the result files because the name acts as a tag of one XML element in the instruction file and white spaces are not permitted for tags. It should be paid attention if you want to implement a software to manipulate the instruction files or the extraction results.
  • key: means the property being a key. Different from keys in relational databases, being a key here means the property must exist on the target page. There are two kinds of keys, Validation keys and Data keys. The first kind means the key attribute will be used to recognize target pages against the current data schema. Once the data snippet cannot be positioned on the target page the page is considered as not recognizable and the status of the SpiderClue record is changed to unknownschema and the thread of extracting data stops. The second type means the key attribute will be used to validate the extraction result so that the result will be discarded if no data snippet has been found for this property. The first is more strict and covers the second. If neither key is set, when no data snippet found on the target page, the value of this property is filled with a reserved word geometa_NAV
  • clue: means a clue to be extracted from the property: If the checkbox is ticked, a clue will be extracted in clue extraction phase by DataScraper over this property. For example, the property's value is an URL so that besides the URL value string is stored as the extracted data a new clue is created against this URL. After the checkbox has been ticked on this work board, an Info clue will be inserted on the Clue Editor work board.
  • url: means the property representing a URL value. If it does, the property will be manipulated specially, for example, the URL value is completed into an absolute path with protocol name, e.g. http, if it was originally a relative path to a Web resource. If one property has attribute clue the checkbox is ticked automatically.
  • block: means a fragment of HTML document instead of a HTML DOM node to be extracted: When an HTML fragment is being extracted MetaSeeker provide a filtering mechanism via which only specific types of DOM nodes in the fragment are extracted, for example, extracting all IMG elements and their attributes under a DIV element. How many types of blocks are supported is shown in book MetaStudio Senior User's Handbook#Property's attributes.
  • null: means the property being skipped. It means that the property with this attribute should be skipped during data extraction and should not appear in the extraction result files. This attribute is used in the scenario when there are multiple patterns for one single data schema and each of the patterns have different skipped properties.

After editing the attributes of the property and pushing Save button, a new property row is appended to the end of the property mapping table. If a new property row wants to be inserted before another one, the later should be selected in advance by ticking the checkbox at the begining of the row.

When a property row initially created, the mapped to DOM node's serial number, the column named as node in the property mapping table, is void. The serial number is assigned during mapping a data snippets on the sample page to the property, which is described in chapter Map property.

Delete properties

Through selecting menu item Property->Delete, one or more property rows are deleted. Don't forget ticking the rows in advance otherwise nothing will be done.

Note: In one bucket, there must be at least one key property. Otherwise it will fail to preview instruction files or upload data schema onto MetaCamp server.



Exercises

Push the button newBckt and input bucket's name company.

Bucket's name appears in GEM and MAP instruction files as Bean's name.

Create the following properties and their attributes:

Property Name Key Clue Url Block Null
name Validation & Data No No No No
company page Validation & Data Yes Yes No No
introduction Validation & Data No No No No
business type No No No No No
register date No No No No No
register capital No No No No No
credit No No No No No