从字符串集合推断模板

发布于 2024-11-14 01:14:40 字数 328 浏览 11 评论 0 原文

我正在为一组网站建立索引,这些网站具有从少量模板生成的大量页面(数千万)。我正在寻找一种算法来学习生成页面的模板并将模板与页面进行匹配,以便我只需要存储变量部分和所获取的每个页面的模板引用。

该算法不需要产生尽可能最大的压缩,但随着它看到更多页面,它应该会变得更好,并且在面对使用以前未见过的模板生成的页面时,它应该表现得优雅。

我将非常感谢任何对文献或现有图书馆的参考。

我可以在批量页面上运行通用压缩算法。我不想这样做的原因是我感兴趣的数据将位于页面的可变部分中,因此模板方法将允许我在不解压缩的情况下检索它。我希望能够在需要时重新创建完整页面,以确保未来的可复制性并防止抓取程序中的错误。

I am indexing a set of websites which have a very large number of pages (tens of millions) that are generated from a small number of templates. I am looking for an algorithm to learn the templates that the pages were generated from and to match templates to pages so that I need to store only the variable part and a template reference for each page fetched.

The algorithm need not produce the greatest compression possible, but it should hopefully become better as it sees more pages and it should behave gracefully when faced with a page generated using a previously unseen template.

I would greatly appreciate any references to literature or existing libraries.

I could run a general purpose compression algorithm on batches of pages. The reason I do not want to do that is that the data of interest to me would be in the variable part of the pages and so the template approach would allow me to retrive it without uncompressing. I want to able to recreate the full page if needed both to ensure future replicability and to guard against bugs in my scraping program.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

爱格式化 2024-11-21 01:14:40

在某些圈子里,这个问题被称为“HTML 包装归纳”或“包装学习”。在本文中,您可以找到有趣的(尽管很旧)评论以及一些商业应用程序的链接:http://www.xrce.xerox.com/Research-Development/Historical-projects/IWRAP-Intelligent-Wrapper-Learning-Tools

您可能对此 Python 库感兴趣:< a href="http://code.google.com/p/templatemaker/" rel="noreferrer">http://code.google.com/p/templatemaker/“好吧,假设您想要得到来自一堆使用相同模板的网页的原始数据 - 例如 Yelp.com 上的餐厅评论 您可以为 templatemaker 提供任意数量的 HTML 文件,它会创建用于创建的“模板”。那些文件。” (http://www.holovaty.com/writing/templatemaker/

另外,另一个Python名为 scrapy 的库似乎有一个包装归纳库: http://dev.scrapy.org/wiki/Scrapy09Changes#Addedwrapperinductionlibrary

不过,我无法透露太多有关算法的信息。如果您想自己实现一个,这看起来是一个很好的起点:http://portal .acm.org/itation.cfm?id=1859138 它具有包装归纳和在线学习功能,因此您可以在继续抓取过程时开始对页面进行分类。

In some circles, this problem is known as "HTML Wrapper Induction" or "Wrapper Learning". In this paper you can find an interesting -- albeit old -- review along with links to some commercial applications: http://www.xrce.xerox.com/Research-Development/Historical-projects/IWRAP-Intelligent-Wrapper-Learning-Tools)

You may be interested in this Python library: http://code.google.com/p/templatemaker/ "Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files." (http://www.holovaty.com/writing/templatemaker/)

Also, another Python library called scrapy seems to have a wrapper induction library: http://dev.scrapy.org/wiki/Scrapy09Changes#Addedwrapperinductionlibrary

I can't tell much about the algorithms, though. If you want to implement one yourself, this looks like a good starting point: http://portal.acm.org/citation.cfm?id=1859138 It features both wrapper induction and online learning, so you can start to classify pages as you continue in the crawling process.

风向决定发型 2024-11-21 01:14:40

您是否考虑过克隆检测?这些工具确定复制粘贴代码的重复使用方式,这听起来很像您想要的。当然,这些页面是这样创建的,或者可能是作为“模板实例”生成的。这种克隆检测器会自动找出共同点。

一些克隆检测器匹配最大长度的相同子串。问题是,Web 结构中的微小差异(额外的空格、换行符、HTML 注释)会阻止许多完全有效的匹配,因此您会错过一些情况。这也有一个缺陷,它只能找到相同的字符串,而不是有变化点的模板。对字符串进行归纳的算法(其他发帖人的答案)可能会遇到这些问题。

您真正想要的是构成(网页)语言的结构的匹配。

许多克隆检测器(“基于令牌”)会发现最多在一个令牌中变化的常见令牌代码序列。通过虚拟使用特定于语言的词法分析器,这些通常可以忽略空格。您得到的是变量点是单个标记的模板。

根据我的经验,变体通常是语言子结构(表达式、语句序列……),因此您希望基于此进行克隆检测。我们的 CloneDR 工具使用语言语法来驱动该过程来查找克隆,也就是说,它对摘要进行检测通过解析页面获得的语法树。 (该维基百科页面列出了几篇关键的克隆检测论文,包括有关 CloneDR 工作原理的论文)。

就您而言,语言语法将是制作网页的语言的混合体,例如 HTML、JavaScript 以及用于动态生成 W​​eb 文本的任何动态语言。 (我们有针对脏 HTML、JavaScript、C#、Java、JSP、PHP 的克隆检测器,[还没有针对 Perl 的克隆检测器,但已经接近了!]...)。您可以在链接中看到一些针对不同语言的克隆检测报告(不幸的是,没有针对 HTML 的报告)。

CloneDR 结果正是共性(模板)、变异点以及变异点的差异。

Have you considered clone detection? These tools determine how copy-and-pasted-code has been reused repeatedly, which sound a lot like what you want. Surely these pages have been created this way, or generated as "template instances" perhaps. Such clone detectors figure out the commonalities automatically.

Some clone detectors match maximal-length identical substrings. The problem with this that trivial differences in the web structure (extra whitespace, newlines, HTML comments) prevent many perfectly valid matches, so you miss cases. This also has the defect that it only finds identical strings, not templates with variation points. Algorithms that do induction over strings (other poster's answer) may suffer from these issues.

What you really want are matches over the structures that make up the languages (of the web pages).

Many clone detectors ("token-based") find common token code sequences that vary in at most one token. These can often ignore the whitespace, by virtual of using a lanugage specific lexer. What you get are templates whose variable points are single tokens.

In my experience, variations are more often language substructures (expressions, statement sequences, ...) and so you want you clone detection based on that. Our CloneDR tool finds clones using langauge grammars to drive the process, that is, it does detection on abstract syntax trees obtained by parsing the pages. (Several key clone detection papers including the one on how CloneDR works are listed on that Wikipedia page).

In your case, the language grammar is going to be a mixture of the langauges that make your web pages, e.g. HTML, JavaScript, and whatever dynamic langauge you used to generate web text dynamically. (We have clone detectors for dirty HTML, JavaScript, C#, Java, JSP, PHP, [not one yet for Perl but close!]...). You can see some clone-detection reports for different languages (unfortunately, not one for HTML) at the link.

The CloneDR results are exactly the commonality (templates), the variation points, and how the variation points differ.

九公里浅绿 2024-11-21 01:14:40

在我看来,由于 HTML 本质上是网页的结构化表示,因此最好的方法是完全忽略文本内容并集中精力识别相似和重复的页面结构。

由于单个网站的设计要求,其每个页面都将包含相同的功能子结构组,例如菜单、侧边栏、面包屑、标题、页脚等,因此,一定深度,同一网站内的所有页面在结构上都是相同的,低于此深度,页面结构会有所不同。通过识别这种“相似度深度”,您应该能够隔离页面中在整个网站语料库中结构不变的那些部分。

然后,对于以相同格式呈现不同数据的页面,例如产品描述或其他数据支持的页面,页面的整个结构将是相同的,仅在文本内容上有所不同。通过屏蔽所有页面共享的不变功能,您应该能够将页面缩小到仅感兴趣的一个或多个部分。

剩下的只是标准化 HTML(使用 HTMLTidy)并比较 DOM 树。

It seems to me that, since HTML is in essence the structured representation of a web page, your best approach would be to ignore textual content altogether and concentrate on identifying both similar and repeating page structures.

Due to the design requirements of a single web site, each of its pages will contain the same group of feature substructures--such as menus, side bars, breadcrumbs, titles, footers, etc.--so that, at a certain depth, all pages within the same web site will be structurally the same and below this depth, page structures will differ. By identifying this "similarity depth", you should be able to isolate those portions of a page which are structurally invariate across the entire site corpus.

Then, for pages which present varying data in the same format, like product descriptions or other data-backed pages, the entire structure of the page will be the same, only varying in textual content. By masking off the invariant features, those shared by all pages, you should be able to reduce the page to only the portion or portions of interest.

The rest is just normalizing HTML (using HTMLTidy) and comparing DOM trees.

预谋 2024-11-21 01:14:40

给定大量使用单个模板的页面,我们期望找到最长公共子序列 这些页面的(LCS)与模板“shell”密切对应。 (如果我们只有 2 或 3 页,那么文本中碰巧以相同顺序出现的字母也会进入 LCS,但这并不是一个令人震惊的问题。)

不幸的是,找到 k 个序列的 LCS需要以 k 为单位的时间指数,但是可以通过计算 LCS(a, LCS(b, LCS(c, ...)) 来产生近似值,其中对 2 个长度为 n 的序列进行每个 LCS() 操作需要 O(n^ 2) 事实上,我希望这种近似能够比精确地解决问题更好地消除虚假文本子序列。

到目前为止,我已经讨论了一种假设的情况,其中所有的情况。的文件使用相同的模板,但我们遇到的问题是有多个模板,为了解决这个问题,我提出了一种聚类算法,我们将在其中构建文件的集群,我们将维护每个集群的 LCS。到目前为止包含在该簇中的所有文件,使用上面给出的成对近似计算;对于第 i 个簇,调用此 clcs[i]。依次对于每个文件,对于每个簇i,我们使用clcs[i]计算文件的LCS,并将该LCS的长度与的长度进行比较>clcs[i]。如果这个长度不比clcs[i]的长度小很多,那么这个文件就“很合适”,所以它被添加到簇i和LCS中刚刚计算出的结果成为新的clcs[i]。如果没有现有簇能够充分适合该文件,则会创建一个新簇,其中仅包含该文件(及其 LCS,等于文件本身)。

关于“不少于”:这是一个需要进行一些调整的权重因子。显然,当一个新集群刚刚诞生并且只包含一个文件时,我们预计使用相同模板生成的另一个文件将具有一个比该集群迄今为止的 LCS 短很多的 LCS,因此我们应该容忍相当LCS 长度下降。随着集群大小的增长,其整体 LCS 应该稳定在模板“shell”上,因此,如果新文件显着减少了 LCS 长度,我们应该不太愿意向集群添加新文件。

如果文件以不同的顺序呈现,该算法通常会产生一组不同的簇,因此尝试一些不同的顺序并检查是否出现相同的簇可能是个好主意。

最后,为了从集群中的给定文件中提取相关的变化信息,可以使用以下算法:

i = 0
for (j = 0 to len(file)) {
    if (file[j] == lcs[i]) {
        ++i;
    } else {
        output file[j]
    }
}

Given a large number of pages that use a single template, we would expect to find that the longest common subsequence (LCS) of these pages corresponds closely to the template "shell". (If we only have 2 or 3 pages, then letters in the text that happen to appear in the same order in each will creep into the LCS as well -- that isn't a showstopper however.)

Unfortunately finding the LCS of k sequences requires time expontential in k, however it's possible to produce an approximation by computing LCS(a, LCS(b, LCS(c, ...)), where each LCS() operation on 2 sequences of length n takes O(n^2) time. In fact I expect this approximation to do a better job at weeding out spurious textual subsequences than solving the problem exactly would.

Up to this point I've talked about a hypothetical situation in which all of the files use the same template. But the problem we have is that there are multiple templates. To address this, I propose a clustering algorithm in which we build up clusters of files as we go. For each cluster we will maintain the LCS of all files so far included in that cluster, computed using the pairwise approximation given above; call this clcs[i] for the ith cluster. For each file in turn, for each cluster i we compute the LCS of the file with clcs[i], and compare the length of this LCS with the length of clcs[i]. If this length is not much less than the length of clcs[i], then this file is "a good fit", so it is added to cluster i and the LCS just computed becomes the new clcs[i]. If no existing cluster produces a good-enough fit for the file, then a new cluster is created, containing just this file (and its LCS, which is equal to the file itself).

Regarding "not much less than": this is a weighting factor that will need some tweaking. Obviously when a new cluster is freshly born and contains just a single file, we would expect that another file produced using the same template will have an LCS with it that is quite a bit shorter than that cluster's LCS so far, so we should tolerate quite a drop in LCS length. As the cluster grows in size, its overall LCS should stabilise at the template "shell", so we should be less inclined to add a new file to the cluster if it decreases the LCS length considerably.

This algorithm will in general produce a different set of clusters if files are presented in a different order, so it may be a good idea to try a few different orders and check that the same clusters appear.

Finally, to extract the relevant, varying information from a given file in a cluster, the following algorithm works:

i = 0
for (j = 0 to len(file)) {
    if (file[j] == lcs[i]) {
        ++i;
    } else {
        output file[j]
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文