Google 中的重复内容。 Drupal 的 SEO
我有一个已启动并正在运行的 Drupal 站点。该网站没有针对 SEO 进行适当优化,并且由于 /category、/taxonomy 等原因,在 google 中生成了大量重复内容 结构
为:
/var/www/appname/ 这包含一个自定义构建的应用程序 /var/www/appname/drup 这包含我的 drupal 安装
我在 google 搜索站点中浏览了站点结果:appname.com 并且由于 /content、/taxonomy、/node 等而存在很多重复的内容
。 /var/www/appname 中的 ROBOTS.txt .. 已包含以下内容,但令我惊讶的是这些页面仍在索引中。请指教。
User-agent: *
Crawl-delay: 10
Allow: /
Allow: /drup/
# Directories
Disallow: /drup/includes/
Disallow: /drup/misc/
Disallow: /drup/modules/
Disallow: /drup/profiles/
Disallow: /drup/scripts/
Disallow: /drup/themes/
# Files
Disallow: /drup/CHANGELOG.txt
Disallow: /drup/cron.php
Disallow: /drup/INSTALL.mysql.txt
Disallow: /drup/INSTALL.pgsql.txt
Disallow: /drup/install.php
Disallow: /drup/INSTALL.txt
Disallow: /drup/LICENSE.txt
Disallow: /drup/MAINTAINERS.txt
Disallow: /drup/update.php
Disallow: /drup/UPGRADE.txt
Disallow: /drup/xmlrpc.php
# Paths (clean URLs)
Disallow: /drup/admin/
Disallow: /drup/comment/reply/
Disallow: /drup/contact/
Disallow: /drup/logout/
Disallow: /drup/node/add/
Disallow: /drup/search/
Disallow: /drup/user/register/
Disallow: /drup/user/password/
Disallow: /drup/user/login/
# Paths (no clean URLs)
Disallow: /drup/?q=admin/
Disallow: /drup/?q=comment/reply/
Disallow: /drup/?q=contact/
Disallow: /drup/?q=logout/
Disallow: /drup/?q=node/add/
Disallow: /drup/?q=search/
Disallow: /drup/?q=user/password/
Disallow: /drup/?q=user/register/
Disallow: /drup/?q=user/log
I have a Drupal site that is up and running. The site is not properly optimized for SEO and there is lot of duplicate content that gets generated in google because of the /category, /taxonomy etc
The structure is:
/var/www/appname/ This contains a custom built application
/var/www/appname/drup This contains my drupal installation
I went through the site results in a google search site:appname.com and was that there is lot of duplicated content because of /content, /taxonomy, /node etc.
My ROBOTS.txt .. in /var/www/appname has the following already in, but I am surprised that the pages are still getting indexed. Please advise.
User-agent: *
Crawl-delay: 10
Allow: /
Allow: /drup/
# Directories
Disallow: /drup/includes/
Disallow: /drup/misc/
Disallow: /drup/modules/
Disallow: /drup/profiles/
Disallow: /drup/scripts/
Disallow: /drup/themes/
# Files
Disallow: /drup/CHANGELOG.txt
Disallow: /drup/cron.php
Disallow: /drup/INSTALL.mysql.txt
Disallow: /drup/INSTALL.pgsql.txt
Disallow: /drup/install.php
Disallow: /drup/INSTALL.txt
Disallow: /drup/LICENSE.txt
Disallow: /drup/MAINTAINERS.txt
Disallow: /drup/update.php
Disallow: /drup/UPGRADE.txt
Disallow: /drup/xmlrpc.php
# Paths (clean URLs)
Disallow: /drup/admin/
Disallow: /drup/comment/reply/
Disallow: /drup/contact/
Disallow: /drup/logout/
Disallow: /drup/node/add/
Disallow: /drup/search/
Disallow: /drup/user/register/
Disallow: /drup/user/password/
Disallow: /drup/user/login/
# Paths (no clean URLs)
Disallow: /drup/?q=admin/
Disallow: /drup/?q=comment/reply/
Disallow: /drup/?q=contact/
Disallow: /drup/?q=logout/
Disallow: /drup/?q=node/add/
Disallow: /drup/?q=search/
Disallow: /drup/?q=user/password/
Disallow: /drup/?q=user/register/
Disallow: /drup/?q=user/log
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您只需要一个 XML 站点地图来告诉 Google 所有页面的位置,而不是让 Google 自行抓取它。
事实上,当 Stackoverflow 处于测试阶段时,他们试图让爬虫发挥其魔力。然而,在高度动态的网站上,几乎不可能以这种方式获得足够的结果。
因此,通过 XML 站点地图,您可以告诉 Google 每个页面的位置、优先级以及更改的频率。
You just need an XML sitemap that tells Google where all the pages are, rather than letting Google crawl it on its own.
In fact, when Stackoverflow was in beta -- they tried to let the crawler work its magic. However, on highly dynamic sites, it's almost impossible to get adequate results in this fashion.
Thus, with the XML sitemap you tell Google where each page is and what its priority is and how often it changes.
有几个模块可以处理 SEO 和重复内容。
我首先建议安装并检查 http://drupal.org/project/seo_checklist
对于重复的内容,您可以检查 http://drupal.org/project/globalredirect
无论如何, /taxonomy 和 /内容只是列表,您可能希望用某种自定义内容覆盖它们的路径,而不是禁止它们,并让爬虫知道它们正在查看什么。
There are several modules that take care of SEO and duplicated content.
I would first advice to install and go over http://drupal.org/project/seo_checklist
For duplicated content you may check http://drupal.org/project/globalredirect
Anyway, /taxonomy and /content are just lists that instead of disallowing you may want to override their paths with some sort of custom content and let crawlers know what they are looking at.
您可以禁止显示重复内容的目录。正如您所解释的,/content、/taxonomy、/node 显示重复的内容。
在 robots.txt 文件的“目录”部分添加以下代码,以限制搜索引擎对这些目录的访问。
禁止:/drup/content/
禁止:/drup/taxonomy/
禁止:/drup/node/
You can disallow the directory that are showing duplicate content. As you explained that the /content, /taxonomy, /node are showing duplicate content.
Add the following code in the Directories section of robots.txt file to restrict access of search engines to these directories.
Disallow: /drup/content/
Disallow: /drup/taxonomy/
Disallow: /drup/node/
您是否能够使用 Google 网站站长工具验证该网站的所有权:
http://www.google. com/webmasters/tools
如果是这样,我建议您这样做,然后尝试该网站的“诊断”类别下的“Fetch as Googlebot”。如果您的 robots.txt 按预期工作,您的“获取状态”将显示“被 robots.txt 拒绝”。
更改 robots.txt 后,已编入索引的网页可能会挂起一段时间并显示在 Google 搜索结果中。但是 Fetch As Googlebot 可以实时指示当 Googlebot 来敲门时发生的情况...
如果您不希望索引的 URL 能够正常检索,那么您需要关注机器人的问题。我总是建议人们在浏览器中手动检索它(在其网站的根目录下),以仔细检查是否存在明显的错误。
Do you have the ability to verify ownership of the site with Google Webmaster Tools at:
http://www.google.com/webmasters/tools
If so, I'd recommend doing that, then trying "Fetch as Googlebot" under the "Diagnostics" category for that site. Your "Fetch Status" will indicate "Denied by robots.txt" if your robots.txt is working as expected.
Indexed pages can hang for awhile and display in Google search results after you've changed the robots.txt. But the Fetch As Googlebot gives you a real-time indication of what's happening when Googlebot comes knockin...
If the URLs that you don't want indexed are retrieved without a problem, then you'll need to focus on problems with robots.txt...where it's at, syntax, paths listed, etc. I always suggest people retrieve it manually in the browser (at the root of their web site) to double-check against obvious goofs.