抄袭分析器(与网页内容相比)
全世界的大家好,
背景
我是计算机科学专业的最后一年学生。我提出了我的最终双模块项目,它是一个使用 Java 和 MySQL 的剽窃分析器。
抄袭分析器将:
- 扫描上传文档的所有段落。分析从哪个网站复制每个段落的百分比。
- 仅突出显示每个段落中从哪个网站准确复制的单词。
我的主要目标是开发像 Turnitin 这样的东西,如果可能的话进行改进。
我有不到 6 个月的时间来开发该程序。我已经确定了以下范围:
- 网络爬虫实现。可能会利用 Lucene API 或开发我自己的爬虫(就时间开发和可用性而言,哪一个更好?)。
- 散列和索引。改进搜索和分析。
问题
以下是我的问题:
- MySQL 可以存储那么多信息吗?
- 我错过了什么重要的话题吗?
- 您对这个项目有何看法?
- 执行相似性分析有什么建议或技术吗?
- 可以对段落和单词进行哈希处理吗?
预先感谢您的任何帮助和建议。 ^^
Hi everyone all over the world,
Background
I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL.
The Plagiarism Analyzer will:
- Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website.
- Highlight only the words copied exactly from which website in each paragraph.
My main objective is to develop something like Turnitin, improved if possible.
I have less than 6 months to develop the program. I have scoped the following:
- Web Crawler Implementation. Probably will be utilizing Lucene API or developing my own Crawler (which one is better in terms of time development and also usability?).
- Hashing and Indexing. To improve on the searching and analyzing.
Questions
Here are my questions:
- Can MySQL store that much information?
- Did I miss any important topics?
- What are your opinions concerning this project?
- Any suggestions or techniques for performing the similarity analysis?
- Can a paragraph be hashed, as well as words?
Thanks in advance for any help and advice. ^^
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您是否考虑过另一个不会因为缺乏可用资源而注定失败的项目?
如果您真的想要“嘿,让我们抓取整个网络!”路线,您将需要分解 HBase 和 Hadoop 以及大量机器之类的东西。 MySQL 将严重不足。 TurnItIn 声称已经抓取并索引了 120 亿个页面。谷歌的索引更像是[已编辑]。 MySQL,或者就此而言,任何 RDBMS 都无法扩展到该级别。
能够实现这一目标的唯一现实方法是,如果您做一些非常聪明的事情,并弄清楚如何构建对 Google 的查询,从而揭示 Google 索引中已存在的文档的抄袭行为。我建议使用消息队列并同步访问搜索 API。消息队列还允许您将查询降低到合理的速率。避免停用词,但您仍在寻找近乎精确的匹配项,因此查询应类似于:
"* Quick Brown Fox gone over * Lazy Dog"
不要费心运行最终像这样的查询:"* * gone * * *"
并忽略返回 94,000,000 次点击的结果。这些不会是抄袭,它们会是著名的引言或过于笼统的查询。您正在寻找与您的原始句子或某些类似指标完全匹配的少于 10 个点击或几千个点击。即便如此,这也应该只是一个启发式的方法——除非有很多危险信号,否则不要标记文档。相反,如果一切都以零点击率返回,那么它们就异常原创。图书搜索通常需要更精确的查询。足够可疑的内容应该触发对原始页面的 HTTP 请求,并且最终决定应该始终由人来决定。如果文档引用了其来源,则这不是抄袭,您需要检测到这一点。误报是不可避免的,并且很可能很常见,即使不是恒定的。请注意,TOS 禁止永久存储 Google 索引的任何部分。
无论如何,你选择做一些极其困难的事情,无论你如何构建它,并且可能非常昂贵且耗时,除非你涉及谷歌。
Have you considered another project that isn't doomed to failure on account of lack of resources available to you?
If you really want to go the "Hey, let's crawl the whole web!" route, you're going to need to break out things like HBase and Hadoop and lots of machines. MySQL will be grossly insufficient. TurnItIn claims to have crawled and indexed 12 billion pages. Google's index is more like [redacted]. MySQL, or for that matter, any RDBMS, cannot scale to that level.
The only realistic way you're going to be able to pull this off is if you do something astonishingly clever and figure out how to construct queries to Google that will reveal plagiarism of documents that are already present in Google's index. I'd recommend using a message queue and access the search API synchronously. The message queue will also allow you to throttle your queries down to a reasonable rate. Avoid stop words, but you're still looking for near-exact matches, so queries should be like:
"* quick brown fox jumped over * lazy dog"
Don't bother running queries that end up like:"* * went * * *"
And ignore results that come back with 94,000,000 hits. Those won't be plagiarism, they'll be famous quotes or overly general queries. You're looking for either under 10 hits or a few thousand hits that all have an exact match on your original sentence or some similar metric. And even then, this should just be a heuristic — don't flag a document unless there are lots of red flags. Conversely, if everything comes back as zero hits, they're being unusually original. Book search typically needs more precise queries. Sufficiently suspicious stuff should trigger HTTP requests for the original pages, and final decisions should always be the purview of a human being. If a document cites its sources, that's not plagiarism, and you'll want to detect that. False positives are inevitable, and will likely be common, if not constant.Be aware that the TOS prohibit permanently storing any portion of the Google index.
Regardless, you have chosen to do something exceedingly hard, no matter how you build it, and likely very expensive and time-consuming unless you involve Google.
1)制作自己的网络爬虫?看起来您可以轻松地将所有可用时间用于此任务。尝试使用标准解决方案:它不是程序的核心。
您仍然有机会自己制作或之后尝试另一种(如果您还有时间!)。
您的程序应该仅在本地文件上运行,以免与特定的爬虫/API 绑定。
也许您甚至需要对不同的站点使用不同的爬虫
2) 可以对整个段落进行哈希处理。您可以散列任何字符串。
但这当然意味着您只能检查是否准确复制了整个段落。
也许句子是一个更好的测试单元。
您可能应该在散列之前“规范化”(转换)句子/段落,以找出大写/小写等细微差别。
3)MySQL可以存储大量数据。
通常的建议是:坚持标准 SQL。如果您发现数据太多,您仍然可以使用其他 SQL 实现。
但当然,如果您有太多数据,请首先考虑减少数据的方法,或者至少减少 mySQL 中的数据。例如,您可以将哈希值存储在 MySQL 中,但将原始页面(如果需要)存储在普通文件中。
1) Make your own web crawler ? looks like you can easily use all your available time just for this task. Try using a standard solution for that : it's not the heart of your program.
You still will have the opportunity to make your own or try another one afterwards (if you have time left !).
Your program should work only on local files so as not to be tied to a specific crawler/API.
Maybe you'll even have to use different crawlers for different sites
2) Hashing whole paragraphs is possible. You can just hash any string.
But of course that means you can only check for whole paragrpahs copied exactly.
Maybe sentences would be a better unit to test.
You probably should "normalize" (tranform) the sentences/paragrpahs before hashing to sort out minor differences like uppercase/lowercase.
3) MySQL can store a lot of data.
The usual advice is : stick to standard SQL. If you discover you have way too much data you will still have the possibility to use another SQL implementation.
But of course if you have too much data, start by looking at ways to reduce it or at least to reduce what's in mySQL. for example you could store hashes in MySQL but original pages (if needed) in plain files.
在线代码通常在开源许可证下分发。大多数代码只是教程。按照你的逻辑,从任何网站复制任何内容都是抄袭。这意味着您无法接受和使用在这里获得的任何答案。如果你真的想完成你的项目,只需编写一个系统来比较同班和以前班级学生的代码即可。它的效率要高得多。这种系统的一个例子是 MOSS (还有一篇论文讨论了它是如何工作的)。这东西确实高效,不需要任何网络爬虫。
Online code is usually distributed under OpenSource licenses. And most of code is just tutorials. According to your logic, copying anything from any website is plagiarism. Which means you can not accept and use any answer you get here. If you really want to finish your project, just write a system that would compare code from students in the same class and previous classes. It is much more efficient. An example of such a system is MOSS (there's also a paper talking about how it works). This thing is really efficient without any web crawlers.