比较网站的文字内容

发布于 2024-08-13 07:57:23 字数 642 浏览 11 评论 0原文

我正在尝试一些文本比较/基本抄袭检测,并希望在网站到网站的基础上进行尝试。然而,我在寻找处理文本的正确方法方面有点困难。

您如何处理和比较两个网站的内容是否抄袭?

我正在考虑这样的伪代码:

// extract text
foreach website in websites
  crawl website - store structure so pages are only scanned once
  extract text blocks from all pages - store this is in list

// compare      
foreach text in website1.textlist
  compare with all text in website2.textlist

我意识到这个解决方案可以非常快速地积累大量数据,因此可能只能使其适用于非常小的网站。

我还没有决定实际的文本比较算法,但现在我更感兴趣的是首先让实际的处理算法发挥作用。

我认为将所有文本提取为单独的文本片段(从段落、表格、标题等)是一个好主意,因为文本可以在页面上移动。

我正在用 C#(也许是 ASP.NET)实现它。

我对您可能提出的任何意见或建议非常感兴趣,所以请拍摄! :)

I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.

How would you process and compare the content of two websites for plagiarism?

I'm thinking something like this pseudo-code:

// extract text
foreach website in websites
  crawl website - store structure so pages are only scanned once
  extract text blocks from all pages - store this is in list

// compare      
foreach text in website1.textlist
  compare with all text in website2.textlist

I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.

I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.

I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.

I'm implementing this in C# (maybe ASP.NET).

I'm very interested in any input or advice you might have, so please shoot! :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

给我一枪 2024-08-20 07:57:23

我解决这个问题的方法是通过谷歌搜索您想要保护其版权的特定的、相当独特的文本块。

话虽如此,如果您想构建自己的解决方案,这里有一些评论:

  • 尊重 robots.txt。如果他们已将该网站标记为“请勿抓取”,那么他们很可能并不想从您的内容中获利。
  • 随着网站的变化,您将需要不时刷新您存储的网站结构。
  • 您需要将文本与 HTML 标签和 JavaScript 正确分开。
  • 您基本上需要在页面的整个文本(删除标签/脚本)中进行全文搜索,以找到您想要保护的文本。有一些很好的、已发布的算法可以实现这一点。

My approach to this problem would be to google for specific, fairly unique blocks of text whose copyright you are trying to protect.

Having said that, if you want to build your own solution, here are some comments:

  • Respect robots.txt. If they have marked the site as do-not-crawl, chances are they are not trying to profit from your content anyway.
  • You will need to refresh the site structure you have stored from time-to-time as websites change.
  • You will need to properly separate text from HTML tags and JavaScript.
  • You will essentially need to do a full text search in the entire text of the page (with tags/Script removed) for the text you wish to protect. There are good, published algorithms for this.
梦旅人picnic 2024-08-20 07:57:23

您可能会对片段检测更感兴趣。例如,很多页面上都会有“主页”一词,但您并不关心。但很多页面在整个页面上都具有完全相同的单词是相当不可能的。因此,您可能想要比较和报告长度为 4、5、6、7、8 等单词的精确匹配的页面以及每个长度的计数。分配一个分数并对它们进行加权,如果您超过了您的“神奇数字”,请报告可疑的复印者。

对于 C#,您可以使用 webBrowser() 获取页面并相当轻松地获取其文本。抱歉,没有方便复制/粘贴的代码示例,但 MSDN 通常有很好的示例。

You're probably going to be more interested in fragment detection. for example, lots of pages will have the word "home" on them and you don't care. But it's fairly unlikely very many pages will have exactly the same words on the entire page. So you probably want to compare and report on pages that have exct matches of length 4,5,6,7,8, etc words and counts for each length. Assign a score and weight them and if you exceed your "magic number" report the suspected xeroxers.

For C#, you can use the webBrowser() to get a page and fairly easily get its text. Sorry, no code sample handy to copy/paste but MSDN usually has pretty good samples.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文