使用 PHP 查找多个条目中抄袭的可能性

发布于 2024-11-08 01:37:43 字数 298 浏览 0 评论 0原文

我正在开发一个跟踪帮助台条目的网络应用程序。我们希望找到一种方法来防止人们复制和粘贴有关常见问题的注释 - 我们希望为每个故障呼叫编写原始的帮助台条目。

无论如何,我们有数千个条目,其中一些是相似的,我试图找到一种方法将它们相互比较,并指出任何与其他条目非常相似的条目,即 80% 可能是直接副本等等。

我研究了similar_text()和其他一些内置的PHP函数,但我有兴趣听听是否有人以前做过类似的事情。我不相信我可以有效地使用similar_text(),因为我需要相互比较多个条目,而不是两个字符串。

任何意见都会受到赞赏。

I am working on a web application that tracks helpdesk entries. We want to find a way to prevent people from copying and pasting their notes regarding common issues - we want original helpdesk entries to be written for every trouble-call.

In any case, we have thousands of entries and some of them are similar, I am trying to find a way of comparing them all to eachother and pointing out any entries that are very similar to others, i.e. 80% likely to be a direct copy, etc.

I've looked into similar_text() and a few other built-in PHP functions, but I am interested in hearing if anyone else has done something similar before. I don't believe I can use similar_text() efficiently since I need to compare multiple entries against each other, not two strings.

Any input is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

暮年慕年 2024-11-15 01:37:44

我确实认为similar_text()会做你想要的。只要您的机器有足够的内存来处理比较,它就应该可以正常工作。另请查看 levenshtein() 和 soundex()。

I do think similar_text() would do what you want. As long as your machine has enough memory to handle the comparisons, it should work fine. Also look at levenshtein() and soundex().

七婞 2024-11-15 01:37:44

您可能需要考虑尝试一下 Solr 数据库。虽然您的最终架构可能包含许多不同的字段,但主字段的类型为“文本”,并且包含帮助台条目的文本。默认的 Solr 模式(不需要修改)自动标记文本字段中的数据,以搜索同义词的方式对数据进行索引,“城市”将匹配“城市”等。

最后,使用 Solr,从性能和功能的角度来看,您最终都会得到一个可扩展的解决方案。

You may want to consider giving the Solr database a try. While your final schema will likely contain many different fields, the main field would be of the type "text" and would contain the text of the helpdesk entry. The default Solr schema (requiring no modification) automatically tokenizes the data in the text field, indexes the data in such a way that searches for synonyms are found, "cities" will match "cities", etc.

In the end, using Solr, you will end up with a scalable solution both from a performance standpoint and a functional standpoint.

悲凉≈ 2024-11-15 01:37:44

首先,你为什么关心?如果这是一个可以通过复制和粘贴来响应的常见问题,为什么这不是正确的做法呢?听起来你是为了工作而创造更多的工作。

其次,您可以研究以下内容:
http://en.wikipedia.org/wiki/W-shingling

如果其他选项这里介绍的还不够。

First off, why do you care? If it's a common issue that can be responded with via a copy and paste, why is that not the right thing to do? It sounds like you're generating more work for the sake of work.

Second off, you could look into something like:
http://en.wikipedia.org/wiki/W-shingling

If the other options presented here don't suffice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文