我正在使用 nutch 爬行我们的大型网站,然后使用 solr 进行索引,结果非常好。然而,网站上有几个菜单结构会索引并破坏查询结果。
这些菜单中的每一个都在 DIV 中明确定义,因此 ...
或 ...
以及其他几个。
我需要在某个时候删除这些 DIVS 中的内容。
我猜测正确的位置是在 solr 索引期间,但无法弄清楚如何进行。
模式看起来像 (
).*?(<\/div>)
但我无法让它在 ).*?(<\/div>)" />
我不太确定该放在哪里它在 schema.xml 中。
当我将该模式放入 schema.xml 时,不会解析。
I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.
Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div>
and several others.
I need to, at some point, delete the content in these DIVS.
I am guessing that the right place is during indexing by solr but cannot work out how.
A pattern would look something like (<div id="calendar">).*?(<\/div>)
but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" />
and I am not really sure where to put it in schema.xml.
When I do put that pattern in schema.xml does not parse.
发布评论
评论(4)
这里有一个 SOLR 补丁,您可以将其放置在索引配置中以忽略您配置的标签的内容。不过,它只适用于 XML,所以如果您可以整理 HTML 或者您知道它是 XHTML,那么这可以工作,但它不适用于任何随机 HTML。
Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.
我认为您有几个选择:
我认为最简单的方法是执行#2,如果您有一致的绘制标题的方法(即皮肤或公共包含)。然后也许#1 和#4。我认为#3 是最难的,但我可能是错的。
I think you have a few choices:
I would think the easiest way to do this, is to do #2 if you have a consistent way of drawing the header (ie a skin or a common include). Then perhaps #1 and #4. I think #3 would be the hardest, but I might be wrong.
Nutch 1.12 中引入了一个新功能,使用 apache tika 解析器,该解析器使用锅炉管算法,在解析阶段本身从 html 页面中剥离页眉和页脚内容。
我们可以在 nutch-site.xml 中使用以下属性来实现此功能:
它对我有用。希望它也适用于其他人...:)
有关详细概述,您可以参考这张票:
https://issues.apache.org/jira/browse/NUTCH-961
A new feature has been introduced in Nutch 1.12 using apache tika parser which works on boilerpipe algorithm to strip off the header and footer content from html pages in parsing stage itself.
We can use following properties in nutch-site.xml to have this implemented :
Its working for me. Hope it will work for others as well...:)
For detailed overview, you can refer to this ticket :
https://issues.apache.org/jira/browse/NUTCH-961
如果你想这样做,我相信你应该用 nutch 编写一个自定义的解析器,这样要索引的数据不包含该数据。
基本上解析后的文本数据是没有任何结构的原始文本。
If you want to do that I believe you should write a customized parser in nutch, such that the data to index does not contain the data.
Basically after parsing the text data is raw text without any structure.