Solr 和 Nutch - 如何控制 Facet?
抱歉,如果这个问题可能太笼统了。如果有的话,我会很高兴获得良好的文档链接。谷歌不会帮我找到它们。
我需要了解如何从 Nutch 爬行然后由 Solr 索引的网站中提取分面。在网站上,页面具有元标记,例如 或
在上面的示例中,我想手动指定将元名称“categories”视为构面,但内容应动态用作类别。
有道理吗?是否可以使用 Nutch 和 Solr,或者我应该重新考虑使用它的方式?
Sorry if this question might be too general. I'd be happy with good links to documentation, if there are any. Google won't help me find them.
I need to understand how facets can be extracted from a web site crawled by Nutch then indexed by Solr. On the web site, pages have meta tags, like <meta name="price" content="123.45"/>
or <meta name="categories" content="category1, category2"/>
. Can I tell Nutch to extract those and Solr to treat them as facets?
In the example above, I want to specify manually that the meta name "categories" is to be treated as a facet, but the content should be dynamically used as categories.
Does it make sense? Is it possible to do with Nutch and Solr, or should I rethink my way of using it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我没有使用过 Nutch (我使用 Heritrix),但最终,Nutch 需要提取“元”标签值并将其索引到 Solr 中(使用 SolrJ 作为前),并使用不同的 solr 字段“价格”, “类别”等
然后你就可以了
获取每个类别的方面。这是有关方面的页面:
I haven't used Nutch (I use Heritrix), but at the end of the day, Nutch need to extract the "meta" tag values and index them in Solr (using SolrJ for ex), with different solr fields "price", "categories", etc
Then you do
to get facets per categories. Here is a page on facets:
其中一个选项是将 nutch 与 元数据插件 一起使用,
尽管它是作为示例给出的,它很大程度上包含在发行版中。
假设您了解使用 nutch 配置和爬取数据的其他过程
在建立索引之前,您需要配置 nutch 以使用这样的元数据插件。
编辑conf/nutch-site.xml
需要索引的元数据标签,例如价格可以作为另一个属性提供
现在,您可以运行nutch scrap命令。使用 solr 进行爬行并建立索引后,您应该在索引中看到一个字段 Price。可以通过在查询中添加facet.field来使用facet搜索。
以下是一些感兴趣的链接。
One of the options is to use nutch with metadata plugin
Although it is given as an example, it is very much included with the distribution.
Assuming you know the other processes of configuring, and crawling data using nutch
Before indexing, you need to configure nutch to use metadata plugin like this.
Edit conf/nutch-site.xml
The metadata tags that need to be indexed, like price can be supplied as another property
Now, you can run the nutch crawl command. After crawling and indexing with solr, you should see a field price in the index. The facet search can be used by adding facet.field in your query.
Here are some links of interest.