为什么 nutch 解析 application/x-javascript 文件?

发布于 2024-10-02 00:35:27 字数 1028 浏览 5 评论 0原文

我在 conf/nutch-site.xml 中配置了 nutch,

<property>
  <name>plugin.includes</name>
  <value>urlfilter-regex|protocol-(http|file)|parse-(text|html|pdf|msword)|in
dex-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic
|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

请注意解析器列表 - 仅文本、html、pdf 和 msword。但出于某种奇怪的原因,我刚刚在我的索引中发现了一些 application/x-javascript 文件。为什么会这样呢?它是否使用插件目录中的内容并忽略我的plugin.includes?

I configured nutch with the following in my conf/nutch-site.xml

<property>
  <name>plugin.includes</name>
  <value>urlfilter-regex|protocol-(http|file)|parse-(text|html|pdf|msword)|in
dex-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic
|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

Note the list of parsers - only text, html, pdf and msword. But for some strange reason I just discovered some application/x-javascript files in my index. Why would that be? Is it using what's in the plugins directory and disregarding my plugin.includes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

岁月静好 2024-10-09 00:35:27

我使用 Nutch 1.1(不是 trunk)来解析 rss feed。我使用 parse-rss 插件。仅当我激活插件时才会解析提要。如果没有,它们将被忽略。所以回答你的问题,是的,Nutch 应该只使用在plugin.includes 中定义的插件。

I use Nutch 1.1 (not trunk) to parse rss feeds. I use the parse-rss plugin. Feeds are only parsed if I activate the plugin. If not, they are ignored. So to answer your question, yes, Nutch should only be using the plugins defined in plugin.includes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文