Tagsoup 无法从 StringReader ( java ) 解析 html 文档

发布于 2024-08-22 17:33:05 字数 1893 浏览 10 评论 0原文

我有这个函数:

private Node getDOM(String str) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

它需要一个字符串,其中包含http服务器在POST请求后发送的html文档,但无法正确解析它 - 我只从整个文档中得到四个节点。字符串本身看起来不错 - 如果我将其打印出来并将其复制到文本文档中,我会看到我期望的页面。

当我使用上述方法的重载版本时:

private Node getDOM(URL url) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

然后一切正常 - 我得到一个正确的 DOM 树,但我需要以某种方式从服务器检索 POST 答案。

将字符串存储在文件中并读回它不起作用 - 仍然得到相同的结果。

可能是什么问题?

I have this function:

private Node getDOM(String str) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.

When I use an overloaded version of the above method:

private Node getDOM(URL url) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.

Storing the string in a file and reading it back does not work - still getting the same results.

What could be the problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

在梵高的星空下 2024-08-29 17:33:05

是不是xml编码有问题?

Is it maybe a problem with the xml encoding?

2024-08-29 17:33:05

这似乎是一个编码问题。在您的代码示例中,您将 url 作为字符串传递到构造函数中,该构造函数将其用作 systemId,并且您在 Tagsoup 解析 html 时遇到问题,但该代码示例不起作用。在有效的示例中,您将流传递给 InputSource 构造函数。不同之处在于,当您传入流时,SAX 实现可以从流中找出编码。

如果您想对此进行测试,可以尝试以下步骤:

  • 通过 java.io.InputStreamReader 流式传输您正在解析的 html,并对其调用 getEncoding 以查看它检测到的编码。
  • 在第一个示例代码中,对 InputSource 调用 setEncoding,传入 inputStreamReader 报告的编码。
  • 看看第一个示例是否已更改为显式设置编码,是否可以正确解析 html。

在关于使用 SAX InputSource 的文章末尾对此进行了讨论

This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.

If you want to test this you could try these steps:

  • Stream the html you're parsing through a java.io.InputStreamReader and call getEncoding on it to see what encoding it detects.
  • In your first example code, call setEncoding on the InputSource passing in the encoding that the inputStreamReader reported.
  • See if the first example, changed to explicitly set the encoding, parses the html correctly.

There's a discussion of this toward the end of an article on using the SAX InputSource.

金橙橙 2024-08-29 17:33:05

要获得 POST 响应,您首先需要执行 POST 请求,new InputSource(url.openStream()) 可能会打开一个连接并从 GET 请求读取响应。请查看使用 URL 发送 POST 请求

其他可能有兴趣检查执行 POST 请求并获取响应的可能性:

To get a POST response you first need to do a POST request, new InputSource(url.openStream()) probably opens a connection and reads the response from a GET request. Check out Sending a POST Request Using a URL.

Other possibilities that might be interesting to check out for doing POST requests and getting the response:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文