检查文档管理应用程序中的文档重复项和类似文档
更新:我现在编写了一个名为php_ssdeep 用于 ssdeep C API,以促进 PHP 中的模糊哈希和哈希比较。更多信息可以在我的博客< /a>。我希望这对人们有帮助。
我参与在 Linux 机器上用 PHP 编写一个自定义文档管理应用程序,该应用程序将存储各种文件格式(可能有 1000 个文件),我们需要能够检查文本文档之前是否已上传,以防止数据库中出现重复。
本质上,当用户上传新文件时,我们希望能够向他们提供重复或包含相似内容的文件列表。然后,他们可以选择预先存在的文档之一或继续上传自己的文档。
相似的文档可以通过查看其内容中的相似句子以及动态生成的关键字列表来确定。然后,我们可以向用户显示匹配百分比,以帮助他们找到重复项。
您能否推荐用于此过程的任何软件包以及您过去如何完成此操作的任何想法?
我认为直接复制可以通过获取所有文本内容并
- 剥离空白
- 删除标点符号
- 转换为小写或大写
然后形成 MD5 哈希来与任何新文档进行比较来完成。例如,如果用户编辑文档以添加额外的段落分隔符,则删除这些项目应该有助于防止找不到重复内容。有什么想法吗?
此过程也可能作为夜间作业运行,如果计算要求太大而无法实时运行,我们可以在用户下次登录时通知用户任何重复项。然而,实时将是首选。
Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.
I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000's of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database.
Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. This would then allow them to choose one of the pre-existing documents or continue uploading their own.
Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. We can then display a percentage match to the user to help them find the duplicates.
Can you recommend any packages for this process and any ideas of how you might have done this in the past?
The direct duplicate I think can be done by getting all the text content and
- Stripping whitespace
- Removing punctuation
- Convert to lower or upper case
then form an MD5 hash to compare with any new documents. Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. Any thoughts?
This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. Realtime would be preferred however.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
更新:我现在已经编写了一个名为php_ssdeep<的PHP扩展/strong> 用于 ssdeep C API,以促进 PHP 中的模糊哈希和哈希比较。更多信息可以在我的博客找到。我希望这对人们有帮助。
我发现了一个程序,它的创建者 Jesse Kornblum 称之为“模糊哈希”。基本上,它生成文件的哈希值,可用于检测相似的文件或相同的匹配项。
其背后的理论记录如下:使用上下文触发的分段散列识别几乎相同的文件
ssdeep 是程序的名称,它可以在 Windows 或 Linux 上运行。它的目的是用于取证计算,但它似乎足够适合我们的目的。我在一台旧的 Pentium 4 机器上做了一个简短的测试,大约需要 3 秒的时间来浏览 23MB 的哈希文件(不到 135,000 个文件的哈希值),寻找两个文件的匹配项。这段时间还包括为我正在搜索的两个文件创建哈希值。
Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.
I have found a program that does what its creator, Jesse Kornblum, calls "Fuzzy Hashing". Very basically it makes hashes of a file that can be used to detect similar files or identical matches.
The theory behind it is documented here: Identifying almost identical files using context triggered piecewise hashing
ssdeep is the name of the program and it can be run on Windows or Linux. It was intended for use in forensic computing, but it seems suited enough to our purposes. I have done a short test on an old Pentium 4 machine and it takes about 3 secs to go through a hash file of 23MB (hashes for just under 135,000 files) looking for matches against two files. That time includes creating hashes for the two files I was searching against as well.
我正在 web2project 中处理类似的问题,经过询问和挖掘后,我得出了“用户不关心”的结论。只要用户可以通过自己的名称找到自己的文档,拥有重复的文档对于用户来说并不重要。
话虽如此,这就是我正在采取的方法:
在整个过程中,我们不会告诉用户这是重复的……他们不在乎。关心的是我们(开发人员、数据库管理员等)。
是的,即使他们稍后上传文件的新版本,这也有效。首先,删除对文件的引用,然后 - 就像垃圾回收一样 - 仅当对旧文件的引用为零时才删除旧文件。
I'm working on a similar problem in web2project and after asking around and digging, I came to the conclusion of "the user doesn't care". Having duplicate documents doesn't matter to the user as long as they can find their own document by its own name.
That being said, here's the approach I'm taking:
Throughout all of this, we don't tell the user it was a duplicate... they don't care. It's us (developers, db admins, etc) that care.
And yes, this works even if they upload a new version of the file later. First, you delete the reference to the file, then - just like in garbage collection - you only delete the old file if there are zero references to it.