获取页面内容,格式与 nutch 中一样
简而言之,我正在寻找一种方法来将页面内容按原样格式化(包含行、新行和段落)。
接下来的代码没有帮助,因为它删除了页面的所有格式。
Parse parse = parseResult.get(content.getUrl());
parse.getText()
Even
BufferedReader br = new BufferedReader(new InputStreamReader(new
ByteArrayInputStream(content.getContent())));
while (br.readLine() != null)
LOG.info("After br: " +br.readLine());
不是解决方案,因为它返回格式化的内容,但带有 html 标签。
我真的希望它保持原始格式,以便能够将其发送到提取所需内容的方法。
谢谢
in nutch, I'm looking for a way to get the content of the page formated as it is(with lines, new lines, and paragraphs).
the coming code doesn't help because it removes all the format of the page.
Parse parse = parseResult.get(content.getUrl());
parse.getText()
even
BufferedReader br = new BufferedReader(new InputStreamReader(new
ByteArrayInputStream(content.getContent())));
while (br.readLine() != null)
LOG.info("After br: " +br.readLine());
is not the solution since it returns the content formatted but with the html tags.
I really want it to be in its original format, to be able to send it to a method that it will extract the needed content.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有直接的方法可以做到这一点。
根据您的需要研究并修改 src\java\org\apache\nutch\segment\ContentAsTextInputFormat.java。
No direct way to do that.
Study and modify
src\java\org\apache\nutch\segment\ContentAsTextInputFormat.java
as per your needs.