为什么 UTF-8 字符在此网页(使用 JSoup 生成)中无法正确呈现?
我在使用 JSoup 库解析和渲染页面时处理字符集时遇到问题。这是它呈现的页面的示例:
http://dl.dropbox.com /u/13093/charset-problem.html
如您所见,哪里应该有 ' 字符,?而是被渲染(即使您查看源代码)。
该页面是通过下载网页、使用 JSoup 解析、然后重新渲染并进行一些结构更改来生成的。
我按如下方式下载页面:
final Document inputDoc = Jsoup.connect(sourceURL.toString()).get();
当我创建输出文档时,我按如下方式操作:
outputDoc.outputSettings().charset(Charset.forName("UTF-8"));
outputDoc.head().appendElement("meta").attr("charset", "UTF-8");
outputDoc.head().appendElement("meta").attr("http-equiv", "Content-Type")
.attr("content", "text/html; charset=UTF-8");
任何人都可以就我做错了什么提供建议吗?
编辑:请注意,源页面是 http://blog.locut.us/ 并且如您所见,它似乎渲染正确
I'm having trouble dealing with Charsets while parsing and rendering a page using the JSoup library. here is an example of the page it renders:
http://dl.dropbox.com/u/13093/charset-problem.html
As you can see, where there should be ' characters, ? is being rendered instead (even when you view the source).
This page is being generated by downloading a web page, parsing with JSoup, and then re-rendering it again having made some structural changes.
I'm downloading the page as follows:
final Document inputDoc = Jsoup.connect(sourceURL.toString()).get();
When I create the output document I do so as follows:
outputDoc.outputSettings().charset(Charset.forName("UTF-8"));
outputDoc.head().appendElement("meta").attr("charset", "UTF-8");
outputDoc.head().appendElement("meta").attr("http-equiv", "Content-Type")
.attr("content", "text/html; charset=UTF-8");
Can anyone offer suggestions as to what I'm doing wrong?
edit: Note that the source page is http://blog.locut.us/ and as you'll see, it appears to render correctly
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
每当您向响应的输出流写入响应的字符编码未涵盖的字符时,问号都是典型的。您在提供响应时似乎依赖于平台默认字符编码。您网站的响应
Content-Type
标头也通过缺少charset
属性确认了这一点。假设您使用 servlet 来提供修改后的 HTML,那么您应该使用
HttpServletResponse#setCharacterEncoding()
在写入之前设置字符编码修改后的 HTML 输出。The question marks are typical whenever you write characters to the outputstream of the response which are not covered by the response's character encoding. You seem to be relying on the platform default character encoding when serving the response. The response
Content-Type
header of your site also confirms this by a missingcharset
attribute.Assuming that you're using a servlet to serve the modified HTML, then you should be using
HttpServletResponse#setCharacterEncoding()
to set the character encoding before writing the modified HTML out.问题很可能是在读取输入页面时,您也需要对源进行正确的编码。
The problem is most likely in reading the input page, you need to have the correct encoding for the source too.