我网站上的每篇博客文章 - http://www.corlated.org - 都存档在其自己的永久链接中网址。
在每个存档页面上,我不仅要显示存档的帖子,还要显示之前发布的 10 篇帖子,以便人们可以更好地了解博客提供的内容类型。
我担心的是,谷歌和其他搜索引擎会将其他帖子视为重复内容,因为每个帖子都会出现在多个页面上。
在我的另一个博客上 - http://coding.pressbin.com - 我曾尝试解决这个问题通过将早期的帖子作为 AJAX 调用加载,但我想知道是否有更简单的方法。
有没有什么方法可以向搜索引擎发出信号,表明页面的特定部分不应被编入索引?
如果没有,是否有比 AJAX 调用更简单的方法来完成我想做的事情?
Each blog post on my site -- http://www.correlated.org -- is archived at its own permalinked URL.
On each of these archived pages, I'd like to display not only the archived post but also the 10 posts that were published before it, so that people can get a better sense of what sort of content the blog offers.
My concern is that Google and other search engines will consider those other posts to be duplicate content, since each post will appear on multiple pages.
On another blog of mine -- http://coding.pressbin.com -- I had tried to work around that by loading the earlier posts as an AJAX call, but I'm wondering if there's a simpler way.
Is there any way to signal to a search engine that a particular section of a page should not be indexed?
If not, is there an easier way than an AJAX call to do what I'm trying to do?
发布评论
评论(3)
警告:这尚未经过实际测试,但根据我对 Google 网站站长中心博客和 schema.org 文档的阅读,应该可以正常工作。无论如何......
这似乎是使用微数据构建内容的一个很好的用例。这涉及将您的内容标记为以下类型的丰富网页摘要 文章,如下所示:
注意itemscope的使用, itemtype 和 itemprop 定义页面上的每篇文章。
现在,根据 Google、Yahoo 和 Bing 支持的 schema.org,搜索引擎应遵循上面
itemprop="url"
描述的规范 URL:所以当标记时通过这种方式,Google 应该能够正确地将哪段内容属于哪个规范 URL,并相应地在 SERP 中对其进行加权。
完成内容标记后,您可以使用 Rich Snippets 测试工具,在您将网页投入生产之前,它应该可以让您很好地了解 Google 对您的网页的看法。
PS,为避免重复内容惩罚,您可以做的最重要事情是修复永久链接页面上的标题。目前,他们都阅读“相关 - 发现令人惊讶的相关性”,这将导致您的排名受到巨大打击。
Caveat: this hasn't been tested in the wild, but should work based on my reading of the Google Webmaster Central blog and the schema.org docs. Anyway...
This seems like a good use case for structuring your content using microdata. This involves marking up your content as a Rich Snippet of the type Article, like so:
Note the use of itemscope, itemtype and itemprop to define each article on the page.
Now, according to schema.org, which is supported by Google, Yahoo and Bing, the search engines should respect the canonical url described by the
itemprop="url"
above:So when marked up in this way, Google should be able to correctly ascribe which piece of content belongs to which canonical URL and weight it in the SERPs accordingly.
Once you've done marking up your content, you can test it using the Rich Snippets testing tool, which should give you a good indication of what Google things about your pages before you roll it into production.
p.s. the most important thing you can do to avoid a duplicate content penalty is to fix the titles on your permalink pages. Currently they all read 'Correlated - Discover surprising correlations' which will cause your ranking to take a massive hit.
恐怕但我认为不可能告诉搜索引擎您的网页的特定区域不应该被索引(例如 HTML 源中的 div)。解决方案是使用 Iframe 来存储您不需要搜索引擎索引的内容,因此我将使用带有适当标签 Disallow 的 robots.text 文件来拒绝访问链接到 Iframe 的特定文件。
I'm afraid but I think it is not possible to tell a Search Engine that a specif are of your web page should not be be indexed (example a div in your HTML source). A solution to this would be to use an Iframe for the content you do not what search engine to index, so I would use a robot.text file with an appropriate tag Disallow to deny access to that specific file linked to the Iframe.
您不能告诉 Google 忽略网页的某些部分,但您可以以搜索引擎无法找到的方式提供该内容。您可以将该内容放置在
中或通过 JavaScript 提供。
我不喜欢这两种方法,因为它们很糟糕。最好的选择是完全阻止搜索引擎访问这些页面,因为无论如何所有内容都是重复的。您可以通过以下几种方式来实现此目的:
使用 robots.txt 阻止您的存档。如果您的档案位于其自己的目录中,那么您可以轻松阻止整个目录。您还可以阻止单个文件并使用通配符来匹配模式。
使用
标记阻止每个页面被索引。
使用
X-Robots-Tag: noindex
HTTP 标头阻止每个页面被搜索引擎索引。这实际上与使用 ` 标记相同,尽管这个标记更容易实现,因为您可以在 .htaccess 文件中使用它并将其应用到整个目录。You can't tell Google to ignore portions of a web page but you can serve up that content in such a way that the search engines can't find it. You can either place that content in an
<iframe>
or serve it up via JavaScript.I don't like those two approaches because they're hackish. Your best bet is to completely block those pages from the search engines since all of the content is duplicated anyway. You can accomplish that a few ways:
Block your archives using robots.txt. If your archives in are in their own directory then you can block the entire directory easily. You can also block individual files and use wildcards to match patterns.
Use the
<META NAME="ROBOTS" CONTENT="noindex">
tag to block each page from being indexed.Use the
X-Robots-Tag: noindex
HTTP header to block each page from being indexed by the search engines. This is identical in effect to using the ` tag although this one can be easier to implement since you can use it in a .htaccess file and apply it to an entire directory.