Drupal 和 Google Search Appliance (Google Mini)
我有一个 Drupal 站点,其中的页面由 Google Mini 搜索设备索引。
本周早些时候,我注意到一堆链接被标记为索引,但被排除在外,因为有一个“打印此页”链接返回同一页面,并且有一个 rel="nofollow"。两天前,我取消了 nofollow 并让 GSA 重新索引了该网站。
现在,相关页面已被标记为 GSA 内的索引,但它们不会显示在网站的搜索结果中。
我可以在 /search/google_appliance/TERM 进行搜索,但它们没有显示。当我搜索其他术语时,它们确实会出现。换句话说,我知道 GSA 正在发挥作用。
当我在 /search/node/TERM [Drupal 默认搜索] 搜索时,我得到的 Drupal 结果不同[显示该术语的页面]。这让我非常确定我正在参加 GSA。
关于为什么新索引的页面没有出现在 GSA 搜索中,您有什么想法吗?
编辑/解决: 有几个问题。以前,搜索使用 xslt 来处理页面的显示方式,以及当您点击提交(在设备上,而不是网站上的提交按钮)时在页面上发送查询的位置。查询字符串以旧格式传递到站点,然后给出 404(与搜索 bookstore.site.com 和 origin.site.com 相同)。更多的是“无法从这里到达那里”之类的问题,而不是与搜索有关的任何问题。我已经删除了 xslt,因此它只使用默认的 google 外观和感觉,并让我们对设备的数据库进行良好的通用搜索。
然而,仍然有一些奇怪的搜索结果返回,Drupal 模块无法解析,并且日志受到 simplexml_load_string() [function.simplexml-load-string]: ^ in \sites\all\modules\google_appliance\GoogleMini 的影响.php 第 318 行。
我尝试了一些查询字符串变量并注释掉设置输出编码的行,一切似乎都有效。 有问题的行位于 google_appliance.module 的第 322 行:
$gm->setOutputEncoding('utf8');
I have a Drupal site with pages indexed by a Google Mini search appliance.
Earlier in the week I noticed that a bunch of links were marked as indexed, but excluded because there was a 'print this page' link back to the same page and had a rel="nofollow". I took the nofollow out and let the GSA reindex the site 2 days ago.
Now, the pages in question are marked as indexed inside the GSA, but they are not showing up in the search results of the site.
I can search at /search/google_appliance/TERM and they do not show up. When I search for other terms, they do show up. In other words, I know that GSA is working.
When I search at /search/node/TERM [Drupal default search], I get the Drupal results which are different[pages with the term shows up]. This makes me pretty sure I'm hitting GSA.
Any ideas on why the newly indexed pages aren't showing up in GSA search?
EDIT/Solved:
There were a couple of issues. Previously the search used an xslt to handle how it displayed the page, and where it sent the queries on the page when you hit submit (on the appliance, not the submit button on the site). The queries string was passed in the old format to the site, which then gave a 404 (same thing as if you do a search of bookstore.site.com, and origin.site.com). More of a ‘can’t get there from here’ sort of problem than anything having to do with searching. I’ve removed the xslt, so it just used the default google look and feel, and lets us do nice, generic searches against the appliance’s database.
However, there were still some weird search results coming back that the Drupal module could not parse and The logs were getting hit with simplexml_load_string() [function.simplexml-load-string]: ^ in \sites\all\modules\google_appliance\GoogleMini.php on line 318.
I experimented with some querystring variables and commenting out the line that sets the Output encoding and all seems to work.
The line in question is in google_appliance.module on line 322:
$gm->setOutputEncoding('utf8');
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我最近遇到了类似的问题。
这里有一个建议——选择一个您知道有搜索词的页面。在浏览器中打开该页面的 HTML 并确保您看到该术语。绝对确定。
接下来,获取该 URL 并将其作为爬行的起始页面之一。
爬网后,进入“搜索诊断”并深入到该页面。你看到它爬行了吗?好的,太好了,现在去看看页面的缓存。 “链接到此页面”的正下方应该是一个名为“缓存版本”的超链接。看看那个。您可能会感到惊讶!我当然是。
I recently struggled with something similar.
One suggestion here -- pick a page which you know has the search term. Open the HTML of the page in your browser and make sure you see that term. Absolutely sure.
Next, take that URL and put it in as one of the starting pages in your crawl.
After the crawl, go into the Search Diagnostics and drill down to that page. Do you see it crawled? Okay, great, now go look at the cache of the page. Right below "Link to this page" ought to be a hyperlink called "Cached version." Look at that. You may be in for a surprise! I certainly was.
我不能 100% 确定我的问题是正确的。我在这里假设:
如果我误解了你的问题,请纠正我。如果我弄错了,请提供有关您所使用的术语的更多详细信息。
然而,这就是我为确定问题根源所做的事情(尽管我可能不会按照这个精确的顺序执行这些操作):
robots.txt
中是否有任何可能干扰索引的规则。GSA 尊重该文件,例如,如果您的网页的 URL 以 < 开头code>/admin/,这些页面将被跳过。sitemap.xml
来执行索引。不过,我会检查 drupal 生成的sitemap.xml
文件(如果有),以检查是否存在明显的错误,例如优先级设置为 0。如果您没有这样的文件,并且知道 GSA 使用它,我会尝试使用适当的模块生成一个< /a> 看看这是否可以解决问题。以上所有 - 如果我有机会 - 我会和同行一起做。他或她可以帮助排除“人为因素”作为问题根源(即配置面板中的那个小复选框对他/她来说非常重要,但您以前从未注意到......)。
如果您设法找到有关正在发生的事情的更多提示,请在此处报告。如果这是 drupal 方面的问题,我很确定我或其他在 SO 上闲逛的优秀“drupalists”将能够提供帮助。
哈!
I am not 100% sure I got your question right. I am assuming here that:
Please correct me if I misunderstood your question. Should I have got it wrong, please provide some more details about the terms you are using.
This is however what I I would do for identifying the source of the problem (although I would probably not do these in this precise order):
robots.txt
that might interfere with the indexing. GSA honors that file, so for example if your pages' URL is beginning with/admin/
, those pages will be skipped.sitemap.xml
to perform the indexing. However I would inspect the drupal generatedsitemap.xml
file (if any) to check for blatant errors like a priority set to 0, for example. If you haven't such file, and know that GSA uses it, I would try to generate one with the appropriate module and see if this solves the problem.All of the above - if I had the chance to - I would do it with a peer. He or she could help ruling out the "human factor" as source of the problem (i.e. that little checkbox in the configuration panel that to him/her is so paramount but that you never noticed before...).
If you manage to find out any more hints on what is going on, report them back here. If it is a problem on the drupal side I'm pretty sure me or somebody else of the excellent "drupalists" hanging around on SO will be able to help.
HTH!