我正在使用 nutch 爬行我们的大型网站,然后使用 solr 进行索引,结果非常好。然而,网站上有几个菜单结构会索引并破坏查询结果。
这些菜单中的每一个都在 DIV 中明确定义,因此 ...
或 ...
以及其他几个。
我需要在某个时候删除这些 DIVS 中的内容。
我猜测正确的位置是在 solr 索引期间,但无法弄清楚如何进行。
模式看起来像 (
).*?(<\/div>)
但我无法让它在 ).*?(<\/div>)" />
我不太确定该放在哪里它在 schema.xml 中。
当我将该模式放入 schema.xml 时,不会解析。
我添加这一行以便编辑保持不变
I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.
Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div>
and several others.
I need to, at some point, delete the content in these DIVS.
I am guessing that the right place is during indexing by solr but cannot work out how.
A pattern would look something like (<div id="calendar">).*?(<\/div>)
but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" />
and I am not really sure where to put it in schema.xml.
When I do put that pattern in schema.xml does not parse.
I am adding this line so the edit sticks
发布评论
评论(1)
您是否看过 solr 中可用的 HTML 不同的 HTML 标记器?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory
他们应该帮助您解决这个问题。你不应该索引 html 标签本身。但是,如果您需要唯一标识某些标签,那么您将需要创建单独的字段并将这些特殊标签的内容存储在这些字段中。
have you looked at the HTML different HTML tokenizers available within solr ?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory
they should help you resolve this issue. you should not index the html tags themselves. however if you need to uniquely identify certain tags then you will need to create individual fields and store the contents of those special tags in those fields.