规范链接作为对抗爬虫的一种方式?

发布于 2024-08-27 06:27:25 字数 409 浏览 10 评论 0原文

假设有几个外部网站正在抓取/收获您的内容并将其作为自己的内容发布。我们还假设您为每段内容维护一个唯一/永久的 URL,因此内容别名(在您的网站上)永远不会成为问题。

从 SEO 角度来看,包含规范链接 无论如何,当您的网站被“抓取”时,规范指示会被注入到窃取您内容的任何网站中(假设他们获取原始 HTML 而不是通过 RSS 等进入)?

我听说过关于跨站点规范链接行为的不同说法,从“它们被忽略”到“行为未定义”到“它不会造成伤害”到“确保这正是规范的目的”。我的印象是,规范是处理站点内别名的好方法,但不一定是处理站点间别名的方法。

Let's say several external sites are scraping/harvesting your content and posting it as their own. Let's also say that you maintain a single unique/permanent URL for each piece of content, so that content aliasing (on your site) is never an issue.

Is there any value from an SEO perspective to including a canonical link in your header anyway, such that when your site is "scraped", the canonical indication is injected into whatever site is stealing your content (assuming they harvest the raw HTML rather than going in through RSS etc.)?

I've heard different things about the behavior of cross-site canonical links, from "they're ignored" to "behavior undefined" to "it can't hurt" to "sure that's exactly what canonical is intended for". My impression was that canonical was a good way of dealing with intra-site but not necessarily inter-site aliasing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

岁月蹉跎了容颜 2024-09-03 06:27:25

我无法直接回答你的问题。

您(您公司中的某人)应该联系未经许可联合发布您的内容的各方,并尝试让他们在获得许可的情况下这样做。您应该澄清您对未经授权的联合组织的政策。这当然是一项业务决策,您的业务开发/流程人员和知识产权律师可能必须参与其中。

如果他们坚持继续这样做,而你绝对需要让他们停止,你可以开始向他们的机器人提供垃圾。检测他们的机器人可能并不简单,因为他们可能会伪造一个“真正的”用户代理标头并使用不同的 IP 地址(现在大多数不法分子似乎使用 EC2),但是,如果您成功,那么他们的网站将变得完整垃圾。

一旦他们的网站充满垃圾(或更糟),您就可以再次联系他们,询问他们是否愿意停止他们令人讨厌的行为。

I can't answer your question directly.

You (someone in your company) should contact the parties who are syndicating your content without permission, and try to get them to do it with permission. You should clarify your policy on unauthorised syndication. This is of course a business decision and your business development / process people and IP lawyers will probably have to get involved.

If they persistently continue to do it and you absolutely need to get them to stop, you can start serving junk to their robots. Detecting their robots may be nontrivial, as they will probably be forging a "real" user-agent header and using varying IP addresses (Most miscreants seem to use EC2 these days), however, if you are successful then their web sites will become full of junk.

Once their web sites become full of junk (or worse) then you can contact them again asking them if they'd like to stop their obnoxious behaviour.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文