NekoHTML SAX 片段解析

发布于 2024-12-02 13:46:09 字数 1715 浏览 0 评论 0原文

我正在尝试使用 NekoHTML 解析 HTML 的简单片段:

<h1>This is a basic test</h1>

为此,我设置了 特定的 Neko 功能 不让任何 HTML、HEAD 或 BODY 标记调用 startElement(..) 回调。

不幸的是,它对我不起作用......我当然错过了一些东西,但不知道它会是什么。

这是一个非常简单的代码来重现我的问题:

 public static class MyContentHandler implements ContentHandler {

     public void characters(char[] ch, int start, int length) throws SAXException {
         String text = String.valueOf(ch, start, length);
         System.out.println(text);
     }

     public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException {
         System.out.println(rawName);
     }

     public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException {
         System.out.println("end " + localName);
     }
 }

以及启动测试的 main() :

  public static void main(String[] args) throws SAXException, IOException {
       SAXParser saxReader = new SAXParser();
       // set the feature like explained in documentation : http://nekohtml.sourceforge.net/faq.html#fragments
       saxReader.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
       saxReader.setContentHandler(new MyContentHandler());
       saxReader.parse(new InputSource(new StringInputStream("<h1>This is a basic test</h1>")));
  }

相应的输出:

HTML
HEAD
end HEAD
BODY
H1
This is a basic test
end H1
end BODY
end HTML

而我期待

H1
This is a basic test
end H1

任何想法?

I'm trying to parse a simple fragment of HTML with NekoHTML :

<h1>This is a basic test</h1>

To do so, I've set a specific Neko feature not to have any HTML, HEAD or BODY tag calling startElement(..) callback.

Unfortunatly, it doesn't work for me.. I certainly missed something but can't figured out what it would be.

Here is a very simple code to reproduce my problem :

 public static class MyContentHandler implements ContentHandler {

     public void characters(char[] ch, int start, int length) throws SAXException {
         String text = String.valueOf(ch, start, length);
         System.out.println(text);
     }

     public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException {
         System.out.println(rawName);
     }

     public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException {
         System.out.println("end " + localName);
     }
 }

And the main() to launch a test :

  public static void main(String[] args) throws SAXException, IOException {
       SAXParser saxReader = new SAXParser();
       // set the feature like explained in documentation : http://nekohtml.sourceforge.net/faq.html#fragments
       saxReader.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
       saxReader.setContentHandler(new MyContentHandler());
       saxReader.parse(new InputSource(new StringInputStream("<h1>This is a basic test</h1>")));
  }

The corresponding output :

HTML
HEAD
end HEAD
BODY
H1
This is a basic test
end H1
end BODY
end HTML

whereas I was expecting

H1
This is a basic test
end H1

Any idea ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

渔村楼浪 2024-12-09 13:46:09

我终于明白了!

实际上,我正在 GWT 应用程序中解析 HTML 字符串,我在其中添加了 gwt-dev.jar 依赖项。这个jar包封装了很多外部库,比如xercesImpl。但是嵌入的 xerces 类的版本与 NeokHTML 所需的版本不匹配。

作为(奇怪的)结果,NeokHTML SAX 解析器在使用 gwt-dev 嵌入式 xerces 版本时似乎没有使用任何自定义功能。

因此,我必须重新编写一些代码以删除 gwt-dev 依赖项,顺便说一句,不建议将其添加到任何标准 GWT 项目中。

I finally got it !

Actually, I was parsing my HTML string in a GWT application, where I've added the gwt-dev.jar dependency. This jar packages a lot of external librairies, like the xercesImpl. But the version of embedded xerces classes does not match the one requiered by NeokHTML.

As a (strange) result, it appears that NeokHTML SAX parser didn't use any custom feature when using gwt-dev embedded xerces version.

So, I had to rework some code to remove the gwt-dev dependency, which by the way is not recommanded to be added to any standard GWT project.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文