文档比较引擎/搜索
我有大量文档文件,包括 .pdf、.one、.doc、.docx 等。我试图找到一种方法来比较文件的文本内容以查找重复项或近似匹配项。我有一个基于 LAMP 堆栈的网站,用户将文件上传到该网站。我可以比较上传的文档或运行 cron 作业。我见过在类似的上下文中提到过 Apache Lucene,Zend Search Lucene 似乎是它的强大 PHP 版本,但它们更面向搜索而不是比较。有没有办法利用这些进行比较?
谢谢, 克里斯
I have a large amount of document files that include .pdf, .one, .doc, .docx, etc. I am trying to find a way to compare the text contents of the files to look for duplicates or near matches. I have a site build on a LAMP stack that users upload the files to. I could either compare the documents on upload or run a cron job. I have seen Apache Lucene mentioned in similar context, and Zend Search Lucene seems to be a powerful PHP version of it, but they are more search-oriented than comparison. Would there be a way to leverage these for a comparison purpose?
Thanks,
Chris
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为比较文件的匹配可能比比较接近的匹配要容易得多。这可能需要结合多种方法。
立即,我会使用类似 hash_file() 来获取文件内容的哈希值。然后,您可以得到文件内容的超短表示,您可以使用它与其他文件哈希进行匹配以查找重复项。您可以尝试散列不同的值或尝试收集有关文件的一些信息,例如 strlen () 或类似的东西用于比较“接近重复”。希望这会有所帮助。听起来确实是一个挑战。
I think comparing files for matches would likely be tons easier than comparing for near matches. It would likely take a combo of approaches.
Right off the bat, I would use something like hash_file() to get hash values for your file contents. You can then have a super short representation of the contents of your file which you can use to match up with other file hashes to look for duplicates. You may try hashing different values or trying to gather up some information about your files such as strlen() or something like that for comparisons of "near duplicate". Hopefully this is helpful. Sounds like a challenge for sure.