Java 线性解析 HTML

发布于 2024-12-20 07:02:32 字数 407 浏览 1 评论 0原文

好吧，我想做的是解析这样的东西

你好世界 <你>下划线

线性。

我在网上搜索过，它总是给我一个 DOM 解析器，但这不是我需要的。

我需要一个像这样的树的输出：

                            p
                           / \ \
                      Hello   b  u
                              |   \
                           World  underlined

基本上，我需要用Java解析html并尝试解释它（从html标签显示gui）。

谢谢。

原文

Ok what I'm trying to do is parse something like this Hello World Underlined linearly.

I've searched through the net, and it always gives me a DOM parser, which is not what I need.

I need an output which would be a tree like this:

                            p
                           / \ \
                      Hello   b  u
                              |   \
                           World  underlined

Basically, I need a to parse html in Java and try to interpret it (display a gui from the html tags).

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵强ㄟ 2024-12-27 07:02:32

您需要解释一下线性是什么意思。如果你想解析并获取每个标签，那么 DOM 解析器是选项之一。但对于大型 HTML 文件，DOM 解析器速度很慢。或者，您可以使用 SAX 解析器并创建您自己的解析器。 Dom解析器内部使用SAX解析器

回复收藏 0 原文

风铃鹿 2024-12-27 07:02:32

使用 SAX 解析器可能就是您想要的，因为它从头到尾读取输入文档，为它找到的每个节点调用您的代码。

SAX 是一个 XML 解析器，大多数 HTML 都不是完美的 XML。因此，要使用 SAX，您需要一个可以处理不完美输入的工具。我建议使用 TagSoup，它是免费的且获得 Apache 许可。

然后编写一个实现 org.xml 的类。 sax.ContentHandler。当解析器读取您的输入时，该类将从 SAX 解析器接收回调。它应该像这样调用代码中的方法：

startElement("p")
characters("Hello")
startElement("b")
characters("World")
endElement("b")
startElement("u")
characters("Underlined")
endElement("u")
endElement("p")

（ContentHandler 上的实际方法还有更多参数，为了简单起见，我在示例中省略了这些参数）

Using a SAX parser is probably what you want, because it reads through an input document from start to finish, calling your code for each node it finds.

SAX is an XML parser, and most HTML isn't perfect XML. So to use SAX you need a tool that can handle imperfect input. I suggest TagSoup, which is free and Apache-licensed.

Then you write a class which implements org.xml.sax.ContentHandler. That class will receives callbacks from the SAX parser as the parser reads through your input. It should call methods in your code like this:

startElement("p")
characters("Hello")
startElement("b")
characters("World")
endElement("b")
startElement("u")
characters("Underlined")
endElement("u")
endElement("p")

(The actual methods on ContentHandler have a few more parameters, which I left out of the example for simplicity)

回复收藏 0 原文

~没有更多了~