Drupal 和 Google Search Appliance (Google Mini)

发布于 2024-08-11 10:06:55 字数 1004 浏览 7 评论 0原文

我有一个 Drupal 站点,其中的页面由 Google Mini 搜索设备索引。

本周早些时候,我注意到一堆链接被标记为索引,但被排除在外,因为有一个“打印此页”链接返回同一页面,并且有一个 rel="nofollow"。两天前,我取消了 nofollow 并让 GSA 重新索引了该网站。

现在,相关页面已被标记为 GSA 内的索引,但它们不会显示在网站的搜索结果中。

我可以在 /search/google_appliance/TERM 进行搜索,但它们没有显示。当我搜索其他术语时,它们确实会出现。换句话说,我知道 GSA 正在发挥作用。

当我在 /search/node/TERM [Drupal 默认搜索] 搜索时,我得到的 Drupal 结果不同[显示该术语的页面]。这让我非常确定我正在参加 GSA。

关于为什么新索引的页面没有出现在 GSA 搜索中,您有什么想法吗?

编辑/解决: 有几个问题。以前,搜索使用 xslt 来处理页面的显示方式,以及当您点击提交(在设备上,而不是网站上的提交按钮)时在页面上发送查询的位置。查询字符串以旧格式传递到站点,然后给出 404(与搜索 bookstore.site.com 和 origin.site.com 相同)。更多的是“无法从这里到达那里”之类的问题,而不是与搜索有关的任何问题。我已经删除了 xslt,因此它只使用默认的 google 外观和感觉,并让我们对设备的数据库进行良好的通用搜索。

然而,仍然有一些奇怪的搜索结果返回,Drupal 模块无法解析,并且日志受到 simplexml_load_string() [function.simplexml-load-string]: ^ in \sites\all\modules\google_appliance\GoogleMini 的影响.php 第 318 行。

我尝试了一些查询字符串变量并注释掉设置输出编码的行,一切似乎都有效。 有问题的行位于 google_appliance.module 的第 322 行:

$gm->setOutputEncoding('utf8');

I have a Drupal site with pages indexed by a Google Mini search appliance.

Earlier in the week I noticed that a bunch of links were marked as indexed, but excluded because there was a 'print this page' link back to the same page and had a rel="nofollow". I took the nofollow out and let the GSA reindex the site 2 days ago.

Now, the pages in question are marked as indexed inside the GSA, but they are not showing up in the search results of the site.

I can search at /search/google_appliance/TERM and they do not show up. When I search for other terms, they do show up. In other words, I know that GSA is working.

When I search at /search/node/TERM [Drupal default search], I get the Drupal results which are different[pages with the term shows up]. This makes me pretty sure I'm hitting GSA.

Any ideas on why the newly indexed pages aren't showing up in GSA search?

EDIT/Solved:
There were a couple of issues. Previously the search used an xslt to handle how it displayed the page, and where it sent the queries on the page when you hit submit (on the appliance, not the submit button on the site). The queries string was passed in the old format to the site, which then gave a 404 (same thing as if you do a search of bookstore.site.com, and origin.site.com). More of a ‘can’t get there from here’ sort of problem than anything having to do with searching. I’ve removed the xslt, so it just used the default google look and feel, and lets us do nice, generic searches against the appliance’s database.

However, there were still some weird search results coming back that the Drupal module could not parse and The logs were getting hit with simplexml_load_string() [function.simplexml-load-string]: ^ in \sites\all\modules\google_appliance\GoogleMini.php on line 318.

I experimented with some querystring variables and commenting out the line that sets the Output encoding and all seems to work.
The line in question is in google_appliance.module on line 322:

$gm->setOutputEncoding('utf8');

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

み青杉依旧 2024-08-18 10:06:56

我最近遇到了类似的问题。

这里有一个建议——选择一个您知道有搜索词的页面。在浏览器中打开该页面的 HTML 并确保您看到该术语。绝对确定。

接下来,获取该 URL 并将其作为爬行的起始页面之一。

爬网后,进入“搜索诊断”并深入到该页面。你看到它爬行了吗?好的,太好了,现在去看看页面的缓存。 “链接到此页面”的正下方应该是一个名为“缓存版本”的超链接。看看那个。您可能会感到惊讶!我当然是。

I recently struggled with something similar.

One suggestion here -- pick a page which you know has the search term. Open the HTML of the page in your browser and make sure you see that term. Absolutely sure.

Next, take that URL and put it in as one of the starting pages in your crawl.

After the crawl, go into the Search Diagnostics and drill down to that page. Do you see it crawled? Okay, great, now go look at the cache of the page. Right below "Link to this page" ought to be a hyperlink called "Cached version." Look at that. You may be in for a surprise! I certainly was.

辞旧 2024-08-18 10:06:55

我不能 100% 确定我的问题是正确的。我在这里假设:

  1. GSA 缺少索引的是具有“打印此页面”链接的页面(而不是假设这些页面已被索引,问题在于索引此类页面的可打印版本)
  2. 以下位意味着您可以找到包含其他术语的其他页面,而不是如果您使用其他术语搜索可以找到缺失的页面

我可以在 /search/google_appliance/TERM 进行搜索,但它们没有显示。当我搜索其他术语时,它们确实会出现。换句话说,我知道 GSA 正在发挥作用

如果我误解了你的问题,请纠正我。如果我弄错了,请提供有关您所使用的术语的更多详细信息。

然而,这就是我为确定问题根源所做的事情(尽管我可能不会按照这个精确的顺序执行这些操作):

  1. 我会尝试理解问题的独特元素是什么触发奇怪行为的“坏页面”(如果有)。看来您已经完成了一些挖掘工作,并认为罪魁祸首是打印链接。您是否通过完全删除链接来验证这一点,并查看在这种情况下页面是否被正确索引?
  2. 我会检查robots.txt中是否有任何可能干扰索引的规则。GSA 尊重该文件,例如,如果您的网页的 URL 以 < 开头code>/admin/,这些页面将被跳过。
  3. 我会检查我的网页是否有某种限制其视图的访问控制。如果是这种情况,我会检查GSA 已为此配置。 (当然,这同样适用于未发布的页面,您必须是管理员才能使用外部应用程序查看或索引它们)。
  4. 我不确定 GSA 是否使用 sitemap.xml 来执行索引。不过,我会检查 drupal 生成的 sitemap.xml 文件(如果有),以检查是否存在明显的错误,例如优先级设置为 0。如果您没有这样的文件,并且知道 GSA 使用它,我会尝试使用适当的模块生成一个< /a> 看看这是否可以解决问题。
  5. 我会检查 GSA 生成的站点地图< /a> 看看它是否也显示出任何明显的异常。这显然不是问题,但任何一种不言自明的异常都可以让你走上正轨。
  6. 如果问题不是特定于页面结构的(请参阅此列表的第 1 点),我将开始系统地搜索生成错误的非结构元素。不同的主题是否可以解决问题。停用给定模块是否可以解决问题? (也许问题出在元标记上?也许是“打印此页面”模块?也许某个模块将这些页面的语言设置为与网站其他部分不同的语言?)。所有这些都是不太可能的可能性,但在用大锤砸碎 GSA 之前我也会尝试一下。
  7. 我会浏览(可能是第N次)所有我的 GSA 设置

以上所有 - 如果我有机会 - 我会和同行一起做。他或她可以帮助排除“人为因素”作为问题根源(即配置面板中的那个小复选框对他/她来说非常重要,但您以前从未注意到......)。

如果您设法找到有关正在发生的事情的更多提示,请在此处报告。如果这是 drupal 方面的问题,我很确定我或其他在 SO 上闲逛的优秀“drupalists”将能够提供帮助。

哈!

I am not 100% sure I got your question right. I am assuming here that:

  1. What GSA is missing to index are the pages from which there is the link "print this page" (rather than assuming those pages are indexed and the problem is in indexing the printable version of such pages)
  2. The following bit means that you can find other pages which contain other terms, and not that you can find the missing pages if you search them with another term.

I can search at /search/google_appliance/TERM and they do not show up. When I search for other terms, they do show up. In other words, I know that GSA is working

Please correct me if I misunderstood your question. Should I have got it wrong, please provide some more details about the terms you are using.

This is however what I I would do for identifying the source of the problem (although I would probably not do these in this precise order):

  1. I would try to understand what are the distinctive elements of the "bad pages" (if any) that trigger the odd behaviour. It seems that you have already done some of this digging and consider the culprit to be the print link. Have you verified this by removing the link altogether and see if the pages get correctly indexed in this case?
  2. I would check if there is any rule in robots.txt that might interfere with the indexing. GSA honors that file, so for example if your pages' URL is beginning with /admin/, those pages will be skipped.
  3. I would check if my pages have some kind of access control restricting their view. Should this be the case, I would check that GSA has been configured for that. (The same applies for unpublished pages of course, where you have to be admin to see or index them with an external application).
  4. I am not sure if GSA uses sitemap.xml to perform the indexing. However I would inspect the drupal generated sitemap.xml file (if any) to check for blatant errors like a priority set to 0, for example. If you haven't such file, and know that GSA uses it, I would try to generate one with the appropriate module and see if this solves the problem.
  5. I would inspect the sitemap generated by GSA to see if it shows any blatant anomaly too. This would clearly not be the problem, but any kind of self-explanatory anomaly could put you on the right track.
  6. I the problem is not specific to the page structure (see point #1 of this list) I would begin to systematically search what is the non-structural element that generates the error. Does a different theme solves the problem. Does deactivating a given module solves the problem? (Maybe the problem is with meta-tags? Maybe with the "print this page" module? Maybe a module sets the language of those pages to a different language than the rest of the site?). All of these are rather unlikely possibilities, but before smashing down the GSA with an sledgehammer I would try that too.
  7. I would go through (probably for the Nth time) all the settings of my GSA.

All of the above - if I had the chance to - I would do it with a peer. He or she could help ruling out the "human factor" as source of the problem (i.e. that little checkbox in the configuration panel that to him/her is so paramount but that you never noticed before...).

If you manage to find out any more hints on what is going on, report them back here. If it is a problem on the drupal side I'm pretty sure me or somebody else of the excellent "drupalists" hanging around on SO will be able to help.

HTH!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文