抄袭分析器（与网页内容相比）

发布于 2024-08-07 06:06:27 字数 642 浏览 11 评论 0原文

全世界的大家好，

背景

我是计算机科学专业的最后一年学生。我提出了我的最终双模块项目，它是一个使用 Java 和 MySQL 的剽窃分析器。

抄袭分析器将：

扫描上传文档的所有段落。分析从哪个网站复制每个段落的百分比。
仅突出显示每个段落中从哪个网站准确复制的单词。

我的主要目标是开发像 Turnitin 这样的东西，如果可能的话进行改进。

我有不到 6 个月的时间来开发该程序。我已经确定了以下范围：

网络爬虫实现。可能会利用 Lucene API 或开发我自己的爬虫（就时间开发和可用性而言，哪一个更好？）。
散列和索引。改进搜索和分析。

问题

以下是我的问题：

MySQL 可以存储那么多信息吗？
我错过了什么重要的话题吗？
您对这个项目有何看法？
执行相似性分析有什么建议或技术吗？
可以对段落和单词进行哈希处理吗？

预先感谢您的任何帮助和建议。 ^^

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若言繁花未落 2024-08-14 06:06:27

您是否考虑过另一个不会因为缺乏可用资源而注定失败的项目？

如果您真的想要“嘿，让我们抓取整个网络！”路线，您将需要分解 HBase 和 Hadoop 以及大量机器之类的东西。 MySQL 将严重不足。 TurnItIn 声称已经抓取并索引了 120 亿个页面。谷歌的索引更像是[已编辑]。 MySQL，或者就此而言，任何 RDBMS 都无法扩展到该级别。

能够实现这一目标的唯一现实方法是，如果您做一些非常聪明的事情，并弄清楚如何构建对 Google 的查询，从而揭示 Google 索引中已存在的文档的抄袭行为。我建议使用消息队列并同步访问搜索 API。消息队列还允许您将查询降低到合理的速率。避免停用词，但您仍在寻找近乎精确的匹配项，因此查询应类似于："* Quick Brown Fox gone over * Lazy Dog" 不要费心运行最终像这样的查询: "* * gone * * *" 并忽略返回 94,000,000 次点击的结果。这些不会是抄袭，它们会是著名的引言或过于笼统的查询。您正在寻找与您的原始句子或某些类似指标完全匹配的少于 10 个点击或几千个点击。即便如此，这也应该只是一个启发式的方法——除非有很多危险信号，否则不要标记文档。相反，如果一切都以零点击率返回，那么它们就异常原创。图书搜索通常需要更精确的查询。足够可疑的内容应该触发对原始页面的 HTTP 请求，并且最终决定应该始终由人来决定。如果文档引用了其来源，则这不是抄袭，您需要检测到这一点。误报是不可避免的，并且很可能很常见，即使不是恒定的。

请注意，TOS 禁止永久存储 Google 索引的任何部分。

无论如何，你选择做一些极其困难的事情，无论你如何构建它，并且可能非常昂贵且耗时，除非你涉及谷歌。

Have you considered another project that isn't doomed to failure on account of lack of resources available to you?

If you really want to go the "Hey, let's crawl the whole web!" route, you're going to need to break out things like HBase and Hadoop and lots of machines. MySQL will be grossly insufficient. TurnItIn claims to have crawled and indexed 12 billion pages. Google's index is more like [redacted]. MySQL, or for that matter, any RDBMS, cannot scale to that level.

The only realistic way you're going to be able to pull this off is if you do something astonishingly clever and figure out how to construct queries to Google that will reveal plagiarism of documents that are already present in Google's index. I'd recommend using a message queue and access the search API synchronously. The message queue will also allow you to throttle your queries down to a reasonable rate. Avoid stop words, but you're still looking for near-exact matches, so queries should be like: "* quick brown fox jumped over * lazy dog" Don't bother running queries that end up like: "* * went * * *" And ignore results that come back with 94,000,000 hits. Those won't be plagiarism, they'll be famous quotes or overly general queries. You're looking for either under 10 hits or a few thousand hits that all have an exact match on your original sentence or some similar metric. And even then, this should just be a heuristic — don't flag a document unless there are lots of red flags. Conversely, if everything comes back as zero hits, they're being unusually original. Book search typically needs more precise queries. Sufficiently suspicious stuff should trigger HTTP requests for the original pages, and final decisions should always be the purview of a human being. If a document cites its sources, that's not plagiarism, and you'll want to detect that. False positives are inevitable, and will likely be common, if not constant.

Be aware that the TOS prohibit permanently storing any portion of the Google index.

Regardless, you have chosen to do something exceedingly hard, no matter how you build it, and likely very expensive and time-consuming unless you involve Google.

回复收藏 0 原文

江湖彼岸 2024-08-14 06:06:27

1）制作自己的网络爬虫？看起来您可以轻松地将所有可用时间用于此任务。尝试使用标准解决方案：它不是程序的核心。

您仍然有机会自己制作或之后尝试另一种（如果您还有时间！）。
您的程序应该仅在本地文件上运行，以免与特定的爬虫/API 绑定。

也许您甚至需要对不同的站点使用不同的爬虫

2) 可以对整个段落进行哈希处理。您可以散列任何字符串。
但这当然意味着您只能检查是否准确复制了整个段落。
也许句子是一个更好的测试单元。
您可能应该在散列之前“规范化”（转换）句子/段落，以找出大写/小写等细微差别。

3）MySQL可以存储大量数据。

通常的建议是：坚持标准 SQL。如果您发现数据太多，您仍然可以使用其他 SQL 实现。

但当然，如果您有太多数据，请首先考虑减少数据的方法，或者至少减少 mySQL 中的数据。例如，您可以将哈希值存储在 MySQL 中，但将原始页面（如果需要）存储在普通文件中。

回复收藏 0 原文

回眸一笑 2024-08-14 06:06:27

在线代码通常在开源许可证下分发。大多数代码只是教程。按照你的逻辑，从任何网站复制任何内容都是抄袭。这意味着您无法接受和使用在这里获得的任何答案。如果你真的想完成你的项目，只需编写一个系统来比较同班和以前班级学生的代码即可。它的效率要高得多。这种系统的一个例子是 MOSS （还有一篇论文讨论了它是如何工作的）。这东西确实高效，不需要任何网络爬虫。

回复收藏 0 原文

~没有更多了~