在 nutch 插件中使用 tika
简而言之,我正在实现一个插件,它将获取网页内容并以特殊方式处理它们。
我的主要问题是我想将网页转换为纯文本以便能够处理,我读到tika工具包可以做到这一点
,我找到了这个使用tika解析url的代码,我在filter方法下编写了
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
String plainText = handler.toString();
LOG.info("Mime: " + metadata.get(Metadata.CONTENT_TYPE));
LOG.info("content: " + handler.toString());
}
元数据的结果。 get(Metadata.CONTENT_TYPE) 是 text/html
但 handler.toString() 是空的!
更新: 我还尝试在解析器方法之后使用这一行
LOG.info ("Status : "+ new ParseStatus().toString());
,得到以下结果: 状态:未解析(0,0)
In nutch I'm implementing a plug-in that will get the content of webpages and process them in special way.
My main problem is I want to convert webpages to plainText to be able to processed,, I read that tika toolkit can do that
so, I found this code that use tika to parse urls, I write it under filter method
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
String plainText = handler.toString();
LOG.info("Mime: " + metadata.get(Metadata.CONTENT_TYPE));
LOG.info("content: " + handler.toString());
}
The result of metadata.get(Metadata.CONTENT_TYPE) is text/html
but handler.toString() is empty !
Update:
Also I try to use this line after the parser method
LOG.info ("Status : "+ new ParseStatus().toString());
and I get this result:
Status : notparsed(0,0)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于版本 1.1 Nutch 包含一个 Tika 插件(另请参阅NUTCH-766)应该可以满足您的需求。不知道有没有更全面的文档。您可能想向 Nutch 用户 邮件列表询问更多详细信息(或者这里的某人可以填写在)。
Since version 1.1 Nutch includes a Tika plugin (see also NUTCH-766) that should cover your need. I don't know if there's more comprehensive documentation available. You might want to ask the Nutch users mailing list for more details (or someone here on SO can fill in).
正如 Jukka Zitting 所说,Tika 已经在 nutch 中得到了利用。在您粘贴的代码中,没有任何地方将
metadata
和ParseStatus
设置为任何 nutch 特定数据结构。因此您看不到相应的ParseStatus
。As Jukka Zitting said,
Tika
is already leveraged in nutch. In the code that you pasted, there is no place that you had set themetadata
andParseStatus
to any nutch specific data structure. So you dont see theParseStatus
accordingly.