在爬网或使用 nutch 和 solr 建立索引期间从 html 中删除菜单

发布于 2024-10-31 15:20:30 字数 548 浏览 1 评论 0 原文

我正在使用 nutch 爬行我们的大型网站,然后使用 solr 进行索引,结果非常好。然而,网站上有几个菜单结构会索引并破坏查询结果。

这些菜单中的每一个都在 DIV 中明确定义,因此

...
...
以及其他几个。

我需要在某个时候删除这些 DIVS 中的内容。

我猜测正确的位置是在 solr 索引期间,但无法弄清楚如何进行。

模式看起来像 (

).*?(<\/div>) 但我无法让它在 ).*?(<\/div>)" /> 我不太确定该放在哪里它在 schema.xml 中。

当我将该模式放入 schema.xml 时,不会解析。

I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.

Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to, at some point, delete the content in these DIVS.

I am guessing that the right place is during indexing by solr but cannot work out how.

A pattern would look something like (<div id="calendar">).*?(<\/div>) but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" /> and I am not really sure where to put it in schema.xml.

When I do put that pattern in schema.xml does not parse.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

Smile简单爱 2024-11-07 15:20:30

这里有一个 SOLR 补丁,您可以将其放置在索引配置中以忽略您配置的标签的内容。不过,它只适用于 XML,所以如果您可以整理 HTML 或者您知道它是 XHTML,那么这可以工作,但它不适用于任何随机 HTML。

Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.

我乃一代侩神 2024-11-07 15:20:30

我认为您有几个选择:

  1. 扩展 Nutch HTML 解析器,并添加逻辑以去除标头。 (可能有更好的地方可以做到这一点,比如当你拥有原始数据但在解析 DOM 之前)
  2. 让你的网站足够智能,在 nuch 爬行时不绘制标题。只需检查请求标头中的 User-Agent 值即可轻松完成此操作。您可能需要更好地播种爬行,因为标题中的链接不会帮助 nutch 找到其他页面。
  3. 不知何故让 Solr 删除 nutch 数据的标题。我不确定你会如何做到这一点,我认为这意味着你会失去一些 Nutch/Solr 协同作用。
  4. 以某种方式编辑 Nutch 索引(只是 lucene 索引)。理论上,您可以遍历索引中的所有文档,并对每个文档的正确属性进行修剪。

我认为最简单的方法是执行#2,如果您有一致的绘制标题的方法(即皮肤或公共包含)。然后也许#1 和#4。我认为#3 是最难的,但我可能是错的。

I think you have a few choices:

  1. extend the Nutch HTML parser, and add logic to strip the header out. (There might be better places to do this, like when you have the raw data but before the DOM is parsed)
  2. make your site smart enough to not draw the header when nutch is crawling. This is pretty easy to do by just checking the User-Agent value in the request header. You might need to do a better job of seeding your crawl since the links in the header won't be there to help nutch find the other pages
  3. Somehow get Solr to remove the header for the nutch data. I'm not sure how you'd do this, and I think this means you lose some of the Nutch/Solr synergies.
  4. Somehow edit the Nutch index (just a lucene index). In theory, you could just walk through all documents in the index and do a trimming on the correct property of each Document.

I would think the easiest way to do this, is to do #2 if you have a consistent way of drawing the header (ie a skin or a common include). Then perhaps #1 and #4. I think #3 would be the hardest, but I might be wrong.

只是在用心讲痛 2024-11-07 15:20:30

Nutch 1.12 中引入了一个新功能,使用 apache tika 解析器,该解析器使用锅炉管算法,在解析阶段本身从 html 页面中剥离页眉和页脚内容。

我们可以在 nutch-site.xml 中使用以下属性来实现此功能:

<!-- parse-tika plugin properties -->
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>
<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>DefaultExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

它对我有用。希望它也适用于其他人...:)

有关详细概述,您可以参考这张票:
https://issues.apache.org/jira/browse/NUTCH-961

A new feature has been introduced in Nutch 1.12 using apache tika parser which works on boilerpipe algorithm to strip off the header and footer content from html pages in parsing stage itself.

We can use following properties in nutch-site.xml to have this implemented :

<!-- parse-tika plugin properties -->
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>
<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>DefaultExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

Its working for me. Hope it will work for others as well...:)

For detailed overview, you can refer to this ticket :
https://issues.apache.org/jira/browse/NUTCH-961

笛声青案梦长安 2024-11-07 15:20:30

如果你想这样做,我相信你应该用 nutch 编写一个自定义的解析器,这样要索引的数据不包含该数据。
基本上解析后的文本数据是没有任何结构的原始文本。

If you want to do that I believe you should write a customized parser in nutch, such that the data to index does not contain the data.
Basically after parsing the text data is raw text without any structure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文