比较网站的文字内容
我正在尝试一些文本比较/基本抄袭检测,并希望在网站到网站的基础上进行尝试。然而,我在寻找处理文本的正确方法方面有点困难。
您如何处理和比较两个网站的内容是否抄袭?
我正在考虑这样的伪代码:
// extract text
foreach website in websites
crawl website - store structure so pages are only scanned once
extract text blocks from all pages - store this is in list
// compare
foreach text in website1.textlist
compare with all text in website2.textlist
我意识到这个解决方案可以非常快速地积累大量数据,因此可能只能使其适用于非常小的网站。
我还没有决定实际的文本比较算法,但现在我更感兴趣的是首先让实际的处理算法发挥作用。
我认为将所有文本提取为单独的文本片段(从段落、表格、标题等)是一个好主意,因为文本可以在页面上移动。
我正在用 C#(也许是 ASP.NET)实现它。
我对您可能提出的任何意见或建议非常感兴趣,所以请拍摄! :)
I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.
How would you process and compare the content of two websites for plagiarism?
I'm thinking something like this pseudo-code:
// extract text
foreach website in websites
crawl website - store structure so pages are only scanned once
extract text blocks from all pages - store this is in list
// compare
foreach text in website1.textlist
compare with all text in website2.textlist
I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.
I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.
I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.
I'm implementing this in C# (maybe ASP.NET).
I'm very interested in any input or advice you might have, so please shoot! :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我解决这个问题的方法是通过谷歌搜索您想要保护其版权的特定的、相当独特的文本块。
话虽如此,如果您想构建自己的解决方案,这里有一些评论:
My approach to this problem would be to google for specific, fairly unique blocks of text whose copyright you are trying to protect.
Having said that, if you want to build your own solution, here are some comments:
您可能会对片段检测更感兴趣。例如,很多页面上都会有“主页”一词,但您并不关心。但很多页面在整个页面上都具有完全相同的单词是相当不可能的。因此,您可能想要比较和报告长度为 4、5、6、7、8 等单词的精确匹配的页面以及每个长度的计数。分配一个分数并对它们进行加权,如果您超过了您的“神奇数字”,请报告可疑的复印者。
对于 C#,您可以使用 webBrowser() 获取页面并相当轻松地获取其文本。抱歉,没有方便复制/粘贴的代码示例,但 MSDN 通常有很好的示例。
You're probably going to be more interested in fragment detection. for example, lots of pages will have the word "home" on them and you don't care. But it's fairly unlikely very many pages will have exactly the same words on the entire page. So you probably want to compare and report on pages that have exct matches of length 4,5,6,7,8, etc words and counts for each length. Assign a score and weight them and if you exceed your "magic number" report the suspected xeroxers.
For C#, you can use the webBrowser() to get a page and fairly easily get its text. Sorry, no code sample handy to copy/paste but MSDN usually has pretty good samples.