批量比较/区分 HTML 的工具

发布于 2024-09-07 18:16:34 字数 175 浏览 8 评论 0原文

我有很多从服务器上抓取的 HTML 文件(价值 10,000 个和 GB),我想检查以确保服务器在进行一些修改后产生相同的结果,但忽略一些无关紧要的差异,例如空格,缺少换行符、时间戳、某些数字的微小变化等。

有谁知道有一个工具可以做到这一点?我真的不想做不必要的过滤。

(哦,它需要在linux下运行)

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.

Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.

(Oh and it needs to run under linux)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

剩余の解释 2024-09-14 18:16:34

您可以考虑使用克隆检测器,例如我们的 CloneDR。该工具解析大量计算机程序(HTML 是特例)文件,构建表示每个文件基本结构的抽象语法树,并比较程序的相似性。
因为它是比较本质的程序结构,所以它忽略注释和空格等无关紧要的差异,并确定两个代码段是相同的,或者可以通过替换其他代码块从另一个代码段获得一个代码段。后者允许识别以各种方式修改的代码。您可以在网站上查看在各种计算机语言上运行的克隆检测示例。

在您的情况下,您要查找的是系统 A 中的文件,这些文件本质上是系统 B 中文件的克隆(精确或几乎未命中)。作为一般规则,如果文件 a 是一个变体文件b(例如,进行一些更改)的CloneDr 会将其报告为克隆并显示确切的差异。

在 20,000 个文件的规模上,我可以理解为什么您需要一个工具,并且我可以理解为什么您需要近似匹配而不是精确匹配。

不能在 Linux 下运行,但我认为你的问题很难解决,所以这不是你要优化的。

You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.

In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.

At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.

Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.

娇柔作态 2024-09-14 18:16:34

我在 Windows 中经常使用 winmerge,据我所知,有些人喜欢在 linux 中使用 meld,所以也许这对你有用
http://meld.sourceforge.net/

我通过快速谷歌搜索看到的其他示例是 Kompare,xxdiff。 sourceforge.net 和 kdiff3.sourceforge.net

(只能发布 1 个链接,因此将 xxdiff 和 kdiff3 的地址写为文本)

I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/

Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net

(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)

时光倒影 2024-09-14 18:16:34

Beyond Compare 是购买的软件,实际上物有所值(我从未想过我会听到自己打字!)。它基于 GUI,但可以很好地处理数千个文件。它将允许您使用正则表达式以及空格(行的开头、中间和结尾)指定不重要的更改。该功能集非常广泛,请查看试用下载。

我并不在这家公司工作,我只是每天在工作中使用Beyond Compare,并且每次都享受它!

Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.

I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文