显示抄袭结果
我正在开发抄袭检测框架。在那里,我们首先通过词干提取、同义词替换和停用词删除的方式对文档进行预处理。所以预处理后的文档与原始文档有些不同。
当我们将预处理后的文档输入抄袭函数后,它会返回相似的句子。
然后在我们的 GUI 中,我们必须通过突出显示来显示两个文档和相似的句子。
为了在java中突出显示,我们必须获取单词的索引并突出显示。
问题在于,预处理后的文本与原始文档不同,因此很难索引原始文档中的相似句子。
谁能帮我解决这个问题?
I am in a process of developing a plagiarism detection framework. There we first preprocess the documents in the means of stemming, synonym replacement and stop word removal. So the preprocessed document is somewhat different from the original document.
After we enter the preprocessed document to our plagiarism function it returns the similar sentences.
Then in our GUI we have to display the two documents and the similar sentences by highlighting.
To highlight in java we have to get the index of the words and highlight.
The problem is that the preprocessed text is different from the original document so it is difficult to index the similar sentences in the original document.
Can anyone help me with this problem ??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您必须将某种元数据与预处理文档一起存储,以便将其内容映射到原始文档。例如,保留因删除停用词而导致的所有空白的列表,或存储有关用同义词替换单词的位置的信息。
如果您记录了预处理期间所做的每个更改(位置/替换的文本),那么您应该能够在原始文档中找到原始短语。
You'll have to store some sort of metadata with the preprocessed document that allows to map the content of it to the original document. Like keeping a list of all gaps that result from stop word removal or storing information on where you replaced words with synonyms.
If you record every change that has been made during preprocessing (location/replaced text) then you should be able to find the original phrase in the original document.