如何开发抄袭检测器?

发布于 2024-07-28 20:43:35 字数 104 浏览 3 评论 0原文

我计划制作一个剽窃检测器作为我的计算机科学工程最后一年的项目,为此我想听取您关于如何进行的建议。

如果您能建议我需要关注 CS 的哪些领域以及最适合实现的语言,我将不胜感激。

I am planning to make a Plagiarism Detector as my Computer Science Engineering final year project,for which I would like to take your suggestions on how to go about it.

I would appreciate if you could suggest which all fields in CS I need to focus on and also the language which would be the most appropriate to implement in.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

旧竹 2024-08-04 20:49:58

你最好尝试一下Python,因为使用它开发一个程序很容易..我也在做一个关于抄袭检测器的项目..我建议你先标记字符串..实际上它很复杂,但如果你尝试的话就是这样开发源代码,否则如果您开发文本文件抄袭检测器使用余弦相似度方法、LCS 方法或仅考虑位置..

you better try python,cause its easy to develop a program using this..i'm also doing a project on plagiarism detector..i suggest u to tokenize the string first..actually it is complicated but this is the way if u r trying to develop for source code,else if u r developing plagiarism detector for text file use cosine similarity method,LCS method or simply considering position..

把人绕傻吧 2024-08-04 20:48:14

这是一个简单的代码,用于匹配两个文件之间的相似度百分比

import numpy as np
def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #print (matrix)
    return (matrix[size_x - 1, size_y - 1])

with open('original.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str1=data.replace(' ', '')
with open('target.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str2=data.replace(' ', '')
if(len(str1)>len(str2)):
    length=len(str1)
else:
    length=len(str2)
print(100-round((levenshtein(str1,str2)/length)*100,2),'% Similarity')

在内容相同的目录中创建两个文件“original.txt”和“target.txt”。

Here is a simple code to match the similarity percentage between two file

import numpy as np
def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #print (matrix)
    return (matrix[size_x - 1, size_y - 1])

with open('original.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str1=data.replace(' ', '')
with open('target.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str2=data.replace(' ', '')
if(len(str1)>len(str2)):
    length=len(str1)
else:
    length=len(str2)
print(100-round((levenshtein(str1,str2)/length)*100,2),'% Similarity')

Create two files "original.txt" and "target.txt" in same directory with content.

累赘 2024-08-04 20:46:56

我正在使用 Python 制作一个抄袭检查器作为一个业余爱好项目。
应遵循以下步骤:

  1. 对文档进行标记。

  2. 使用NLTK库删除所有停用词。

  3. 使用 GenSim 库并逐行查找最相关的单词。 这可以通过创建文档的 LDA 或 LSA 来完成。

  4. 使用 Google Search API 搜索这些字词。

笔记:
您可能选择使用 Google API 并立即搜索整个文档。 当您处理少量数据时,这将起作用。 然而,在为网站和网络抓取数据构建抄袭检查器时,我们需要应用 NLTK 算法。

Google 搜索 API 将生成与 Python GenSim 库函数中 LDA 或 LSA 生成的单词相同的热门文章。

希望有帮助。

I am making a plagiarism checker using Python as a hobby project.
The following steps are to be followed:

  1. Tokenize the document.

  2. Remove all the stop words using NLTK library.

  3. Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.

  4. Use Google Search API to search for those words.

Note:
you might have chosen to use the Google API and search the whole document at once. This will work when you are working with smaller amount of data. However when building plagiarism checker for sites and webscraped data, we will need to apply NLTK algorithms.

The Google search API will result in the top articles which have the same words which were resulted in the LDA or LSA from GenSim library functions of Python.

Hope it helped.

噩梦成真你也成魔 2024-08-04 20:45:43

语言几乎无关紧要。 另一个问题对此进行了更多讨论。 基本上,建议的方法是使用Google。 提取部分目标文本,并在 Google 上搜索它们。

The language is nearly irrelevant. Another questions exists that discusses this a bit more. Basically, the method suggested there is to use Google. Extract parts of the target-text, and search for them on Google.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文