Nutch 爬虫未对 HTML 内容建立索引
我正在尝试开发一个搜索功能,在其中输入城市名称,它会为我提供该城市的天气状况。
我已经在我的系统上设置了 Nutch-1.3 和 Solr-3.4.0。我正在爬行的网站位于这里并将索引传递给Solr现在,我想检索此上显示的信息链接,关于查询德里。
我怎样才能实现这个目标?需要写什么插件吗?
<doc><float name="score">1.0</float><float name="boost">0.1879294</float><str name="content"/><str name="digest">d41d8cd98f00b204e9800998ecf8427e</str><str name="id">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str><str name="segment">20111118153543</str><str name="title"/><date name="tstamp">2011-11-18T10:06:45.604Z</date><str name="url">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str></doc>
I am trying to develop a search functionality where I enter a city name and it gives me the weather conditions for that city.
I have set up Nutch-1.3 and Solr-3.4.0 on my system. The website I am crawling is here and passing the index to Solr for searching.Now, I want to retrieve the information displayed on this link, on querying for delhi.
How can I achieve this? Does it require any plugin to be written?
<doc><float name="score">1.0</float><float name="boost">0.1879294</float><str name="content"/><str name="digest">d41d8cd98f00b204e9800998ecf8427e</str><str name="id">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str><str name="segment">20111118153543</str><str name="title"/><date name="tstamp">2011-11-18T10:06:45.604Z</date><str name="url">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str></doc>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Nutch 基本上是通过页面上的链接进行爬行。
但是,印度页面上没有以下链接:它到达 德里页面 提到你。
因此它将无法导航到该页面。
您可以创建自己的虚拟 html 页面,作为索引的起始 URL,并拥有您希望 Nutch 索引的所有链接。
您架构中的默认搜索字段是什么?
通常它是文本字段,查询 delhi 将在该字段中查找匹配项。
由于
*:*
返回 delhi 结果,而 delhi 则不返回。它与正在搜索的字段上的索引标记不匹配。架构中为 url 定义的字段类型是什么?
您可以通过文本分析将该字段复制到其他字段,这将生成 delhi 标记,并且查询
url_copy:delhi
应返回结果。Nutch basically crawls through links on the pages.
However, there are no links on the India page for it to reach the Delhi page mentioned by you.
So it won't be able to navigate it down to that page.
You can create your own dummy html page, acting as the start url for indexing, and have all the links you want Nutch to index.
Whats the default search field in you schema ?
Usually its the text field, and querying for delhi would look into that field for matches.
As
*:*
returns the delhi result, and delhi does not. Its not matching the indexed tokens on the field it is searching on.Whats the field type defined for url in the schema ?
You can copy the field to an other field with text analysis, which would produce the delhi token and querying for
url_copy:delhi
should return you the results.