比较网站的文字内容

发布于 2024-08-13 07:57:23 字数 642 浏览 11 评论 0原文

我正在尝试一些文本比较/基本抄袭检测，并希望在网站到网站的基础上进行尝试。然而，我在寻找处理文本的正确方法方面有点困难。

您如何处理和比较两个网站的内容是否抄袭？

我正在考虑这样的伪代码：

// extract text
foreach website in websites
  crawl website - store structure so pages are only scanned once
  extract text blocks from all pages - store this is in list

// compare      
foreach text in website1.textlist
  compare with all text in website2.textlist

我意识到这个解决方案可以非常快速地积累大量数据，因此可能只能使其适用于非常小的网站。

我还没有决定实际的文本比较算法，但现在我更感兴趣的是首先让实际的处理算法发挥作用。

我认为将所有文本提取为单独的文本片段（从段落、表格、标题等）是一个好主意，因为文本可以在页面上移动。

我正在用 C#（也许是 ASP.NET）实现它。

我对您可能提出的任何意见或建议非常感兴趣，所以请拍摄！ :)

原文

I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.

How would you process and compare the content of two websites for plagiarism?

I'm thinking something like this pseudo-code:

// extract text
foreach website in websites
  crawl website - store structure so pages are only scanned once
  extract text blocks from all pages - store this is in list

// compare      
foreach text in website1.textlist
  compare with all text in website2.textlist

I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.

I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.

I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.

I'm implementing this in C# (maybe ASP.NET).

I'm very interested in any input or advice you might have, so please shoot! :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给我一枪 2024-08-20 07:57:23

我解决这个问题的方法是通过谷歌搜索您想要保护其版权的特定的、相当独特的文本块。

话虽如此，如果您想构建自己的解决方案，这里有一些评论：

尊重 robots.txt。如果他们已将该网站标记为“请勿抓取”，那么他们很可能并不想从您的内容中获利。
随着网站的变化，您将需要不时刷新您存储的网站结构。
您需要将文本与 HTML 标签和 JavaScript 正确分开。
您基本上需要在页面的整个文本（删除标签/脚本）中进行全文搜索，以找到您想要保护的文本。有一些很好的、已发布的算法可以实现这一点。

回复收藏 0 原文

梦旅人picnic 2024-08-20 07:57:23

您可能会对片段检测更感兴趣。例如，很多页面上都会有“主页”一词，但您并不关心。但很多页面在整个页面上都具有完全相同的单词是相当不可能的。因此，您可能想要比较和报告长度为 4、5、6、7、8 等单词的精确匹配的页面以及每个长度的计数。分配一个分数并对它们进行加权，如果您超过了您的“神奇数字”，请报告可疑的复印者。

对于 C#，您可以使用 webBrowser() 获取页面并相当轻松地获取其文本。抱歉，没有方便复制/粘贴的代码示例，但 MSDN 通常有很好的示例。

回复收藏 0 原文

~没有更多了~