Tagsoup 无法从 StringReader ( java ) 解析 html 文档
我有这个函数:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
它需要一个字符串,其中包含http服务器在POST请求后发送的html文档,但无法正确解析它 - 我只从整个文档中得到四个节点。字符串本身看起来不错 - 如果我将其打印出来并将其复制到文本文档中,我会看到我期望的页面。
当我使用上述方法的重载版本时:
private Node getDOM(URL url) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
然后一切正常 - 我得到一个正确的 DOM 树,但我需要以某种方式从服务器检索 POST 答案。
将字符串存储在文件中并读回它不起作用 - 仍然得到相同的结果。
可能是什么问题?
I have this function:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.
When I use an overloaded version of the above method:
private Node getDOM(URL url) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.
Storing the string in a file and reading it back does not work - still getting the same results.
What could be the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
是不是xml编码有问题?
Is it maybe a problem with the xml encoding?
这似乎是一个编码问题。在您的代码示例中,您将 url 作为字符串传递到构造函数中,该构造函数将其用作 systemId,并且您在 Tagsoup 解析 html 时遇到问题,但该代码示例不起作用。在有效的示例中,您将流传递给 InputSource 构造函数。不同之处在于,当您传入流时,SAX 实现可以从流中找出编码。
如果您想对此进行测试,可以尝试以下步骤:
在关于使用 SAX InputSource 的文章末尾对此进行了讨论。
This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.
If you want to test this you could try these steps:
There's a discussion of this toward the end of an article on using the SAX InputSource.
要获得 POST 响应,您首先需要执行 POST 请求,
new InputSource(url.openStream())
可能会打开一个连接并从 GET 请求读取响应。请查看使用 URL 发送 POST 请求。其他可能有兴趣检查执行 POST 请求并获取响应的可能性:
To get a POST response you first need to do a POST request,
new InputSource(url.openStream())
probably opens a connection and reads the response from a GET request. Check out Sending a POST Request Using a URL.Other possibilities that might be interesting to check out for doing POST requests and getting the response: