当前位置：文江博客话题详情

Java 是否有类似于 lxml 或 nokogiri 的库？

发布于 2024-08-18 21:24:22 字数 1536 浏览 7 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡紫姑娘！ 2024-08-25 21:24:22

有十几个用 Java 编写的屏幕抓取库。仅举几例：

TagSoup - 用 Java 编写的 SAX 兼容解析器
解析格式良好或有效的 XML，
解析 HTML，因为它是在
狂野：虽然相当肮脏和野蛮
通常远非短。标签汤是
专为那些必须这样做的人而设计
使用一些处理这个东西
看似理性的应用
设计。通过提供SAX接口，
它允许标准 XML 工具
即使是最糟糕的 HTML 也适用。
Jericho HTML 解析器 - Jericho HTML 解析器是一个简单但功能强大的
java 库允许分析和
HTML 部分的操作
文档，包括一些常见的
服务器端标签，同时复制
逐字记录任何无法识别或无效的内容
HTML。它还提供高级 HTML
表单操作函数。这是
既不是基于事件也不是基于树
解析器，而是使用组合
简单的文本搜索，高效的标签
识别和标签位置缓存。
整个源文档的文本
首先加载到内存中，然后
仅搜索相关片段
对于每个的相关字符
搜索操作。
HTML Cleaner - HtmlCleaner 重新排序各个元素并
从脏数据生成格式良好的 XML
HTML。它遵循类似的规则
大多数网络浏览器按顺序使用
创建文档对象模型。一个
用户可以提供自定义标签和规则
设置标签过滤和平衡。
NekoHTML - NekoHTML 是一个简单的 HTML 扫描器和标签平衡器那
使应用程序员能够
解析 HTML 文档并访问
使用标准 XML 的信息
接口。解析器可以扫描 HTML
文件并“修复”许多常见问题
人类（和计算机）所犯的错误
作者以 HTML 形式编写
文件。 NekoHTML 添加缺失的内容
父元素；自动关闭
带有可选结束标签的元素；和
可以处理不匹配的内联元素
标签。

还有更多用 Java 编写的 HTML 屏幕抓取工具< /a>.但正如我在之前的回答。但这对您来说可能不是问题。

以防万一，也许可以查看线程Nokogiri pure Java状态。

更新：一个新项目已经发布（2010-01-31），jsoup ，它提供了选择器语法来查找元素。请参阅其网站了解更多详细信息和/或作者的此答案。

There are dozen of screen scraping library written in Java. Just to cite a few :

TagSoup - a SAX-compliant parser written in Java that, instead
of parsing well-formed or valid XML,
parses HTML as it is found in the
wild: nasty and brutish, though quite
often far from short. TagSoup is
designed for people who have to
process this stuff using some
semblance of a rational application
design. By providing a SAX interface,
it allows standard XML tools to be
applied to even the worst HTML.
Jericho HTML Parser - Jericho HTML Parser is a simple but powerful
java library allowing analysis and
manipulation of parts of an HTML
document, including some common
server-side tags, while reproducing
verbatim any unrecognised or invalid
HTML. It also provides high-level HTML
form manipulation functions. t is
neither an event nor tree based
parser, but rather uses a combination
of simple text search, efficient tag
recognition and a tag position cache.
The text of the whole source document
is first loaded into memory, and then
only the relevant segments searched
for the relevant characters of each
search operation.
HTML Cleaner - HtmlCleaner reorders individual elements and
produces well-formed XML from dirty
HTML. It follows similar rules that
the most of web-browsers use in order
to create document object model. A
user may provide custom tag and rule
set for tag filtering and balancing.
NekoHTML - NekoHTML is a simple HTML scanner and tag balancer that
enables application programmers to
parse HTML documents and access the
information using standard XML
interfaces. The parser can scan HTML
files and "fix up" many common
mistakes that human (and computer)
authors make in writing HTML
documents. NekoHTML adds missing
parent elements; automatically closes
elements with optional end tags; and
can handle mismatched inline element
tags.

And many more at HTML Screen Scraping Tools written in Java. But these are IMO the best to deal with any kind of content (understand all kind of crap) as I mentioned in this previous answer. This might not be an issue for you though.

Just in case, maybe check out the thread Nokogiri pure Java status.

Update: A new project has been released (the 2010-01-31), jsoup, which offers a selector-syntax to find elements. See its website for more details and/or this answer from its author.

回复收藏 0 原文

×纯※雪 2024-08-25 21:24:22

您可以通过 jRuby 使用 hpricot。有关详细信息，请参阅这个问题。

回复收藏 0 原文

~没有更多了~

关于作者

素衣风尘叹

暂无简介

0 文章

0 评论

24 人气

关注发私信

╭⌒浅淡时光〆

文章 0 评论 0

关注

慕巷

文章 0 评论 0

关注

浅生活

文章 0 评论 0

关注

bal

文章 0 评论 0

关注

lqwuliang

文章 0 评论 0

关注

后来的我们

文章 0 评论 0

友情链接

文江博客

Java 是否有类似于 lxml 或 nokogiri 的库？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签