Apache nutch:在解析之前操作 DOM
我想在页面响应被处理之前删除特定元素。 具体来说,我想用 ie 标记我的页面的一部分
<div class="noindex">I shall not be indexed</div>
,并想在 nuch 解析之前删除它们,这样之后的 NutchDocument 中就不会出现“我不会被索引”的情况。我计划用它来包围我的导航、页眉、页脚内容,因为现在它们存在于索引中的每个文档中。
谢谢, 保罗
I want to remove specific elements from the page response, before it is handed down to nutch.
Specifically, I want to mark parts of my pages with i.e.
<div class="noindex">I shall not be indexed</div>
And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.
Thanks,
Paul
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你有一些替代方案可以做到这一点:
你可以为 nutch 编写一个插件来做到这一点。这个博客有一个在 nutch 中做插件的优秀示例: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
使用提取器内容:此处 http://tomazkovacic.com/blog/122/evaluating -text-extraction-algorithms/ 有一些算法。也许最好的方法也是在插件中。
You have some alternativer for doing that:
You can write a plugin for nutch for doing that. This blog have an execelent example of doing a plugin in nutch: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
Using an extractor content: Here http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/ have some algorithmics. Maybe the best way of doing that it´s also in a pluggin in nutch.