对Web信息提取的一种解释

最近网上流行一篇文章,其中很大篇幅是关于Web信息提取的,感觉是一篇好文,摘抄下面这段:

How Web Scraping Works

页面抓取程序怎样工作

Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page. Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual data is mingled with layout and rendering information and is not readily available to a computer. Scrapers are the programs that “know” how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is. For example, in the illustration below the scraper extracts URLs from the del.icio.us page. By applying such a scraper, it is possible to discover what URLs are tagged with any given tag.

页面抓取本质上是HTML页面的反向工程,也可以看成页面解释器,网页以HTML编码,HTML以树型结构表示信息,实际数据与布局代码以及效果信息混杂在一起,不能被计算机直接利用。抓取器程序“知道”怎样从给定HTML页面中抓取数据。它们通过分析网页特定的标注方式找到实际数据,例如,下图给出了抓取器怎么抓取del.icio.us的页面的示意图。我们可以找到被任意标签标记的链接。


原文:一位学兄的译文