快速搜索 Dokuwiki 页面的 Sharepoint Crawler 问题
我对爬行 Dokuwiki 网站的挫败感达到了极限。
我有一个使用 SharePoint 快速搜索的内容源,我已将其设置为抓取 dokuwiki/doku.php 网站。我的爬虫规则设置为: http://servername/* ,匹配大小写并包含此路径中的所有项目并爬行复杂的网址..测试爬行规则中的内容源表明它会被爬虫爬行。然而......爬行总是持续不到 2 分钟,并且只爬行了我指向的页面,而该页面上没有其他链接。我已与 Dokuwki 管理员核实,他已将机器人文本设置为允许。当我查看页面上的源代码时,我看到它说 meta name="robots" content="index,follow"
所以为了测试其他链接页面不是问题,我手动将这些链接添加到内容源并重新爬网。示例源页面有三个链接
- 站点一个
- 站点B
- 站点 C。
我将站点 A、B 和 C url 添加到爬网源。本次爬取结果为4次成功,主源页面及其他链接A、B、C均为手动添加。
那么我的问题是为什么爬虫不会抓取页面上的链接?这是我需要对我的爬虫做的事情,还是与如何定义名称空间和使用 Dokuwiki 构建链接有关?
任何帮助将不胜感激
埃里克
My level of frustion is maxxing out over crawling Dokuwiki sites.
I have a content source using FAST search for SharePoint that i have set up to crawl a dokuwiki/doku.php site. My crawler rules are set to: http://servername/* , match case and include all items in this path with crawl complex urls.. testing the content source in the crawl rules shows that it will be crawled by the crawler. However..... The crawl always last for under 2 minutes and completes having only crawled the page I pointed to and no other link on that page. I have check with the Dokuwki admin and he has the robots text set to allow. when I look at the source on the pages I see that it says
meta name="robots" content="index,follow"
so in order to test that the other linked pages were not a problem, I added those links to the content souce manually and recrawled.. example source page has three links
- site A
- site B
- site C.
I added Site A,B and C urls to the crawl source. The results of this crawl are 4 successes, the primary souce page and the other links A,B, and C i manually added.
So my question is why wont the crawler crawl the link on the page? is this something I need to do with the crawler on my end or is it something to do with how namespaces are defined and links constructed with Dokuwiki?
Any help would be appreciated
Eric
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是否禁用了延迟索引选项和 rel=nofollow 选项?
Did you disable the delayed indexing options and rel=nofollow options?
该问题与身份验证有关,尽管没有报告任何问题表明这是 FAST 爬网日志中的身份验证。
修复方法是为搜索索引服务器的 IP 地址添加 $freepass 设置,以便 Appache 不会对每个页面点击执行身份验证过程。
回复
感谢埃里克的
The issue was around authentication even though no issues were reported suggesting it was authentication in the FAST Crawl Logs.
The fix was adding a $freepass setting for the IP address of the Search indexing server so that Appache would not go through the authentication process for each page hit.
Thanks for the reply
Eric