Why are there control codes in results

Known to all, the control codes LF and CR, i.e. 0x0a, 0x0d, are not permitted in source codes of a HTML page. The permitted characters are specified in HTML standard. But these control codes are found in the result files generated by MetaSeeker, which happens mainly in case that the data snippets are extracted for a property with attribute block in type of text.

If an author of HTML documents wants to insert line breaks within a paragraph, he has to append an HTML tag <br> to each line to be broken. For example, the author wants to present many lines of codes, he appends a <br> to each line. After the page has been loaded into a browser, the tags are translated into code LF and CR.

MetaSeeker makes use of XSLT's command xsl:value-of to extract all text in this block. As a result, the control codes is also been extracted and stored.

Advantages

Over these control codes, a software program can do special things, e.g. translate the control codes back again to character string <br>s. Otherwise, the software program didn't know where the lines should be broken.

Disadvantages

If the extracted results would be pasted onto a new HTML page, e.g. in an enterprise portal service, the results should not be used directly. There should be a filter to translate the control codes to suitable codes or characters.

Alternatives

The whole HTML fragment with original elements and attributes untouched can be extracted if a property with a block attribute in type of all is defined.