在 Android 上使用 HtlmCleaner 无法正确解码非英语字符
我正在使用 HtmlCleaner
在 Android 中抓取 ISO-8859-1
编码的网站。
我已在导入到 Android 应用程序中的外部 jar
文件中实现了此功能。
当我在 Eclipse 中运行单元测试时,它正确处理挪威字母 (æ,ø,å
)(我可以在调试器中验证这一点),但在 Android 应用程序中,这些字符看起来像倒置的问号。
如果我将调试器附加到我的 Android 应用程序,我可以看到这些字母在从 Eclipse 运行单元测试时正确的位置不正确,因此这不是 Android 应用程序中的显示/渲染/视图问题。
当我从调试器复制文本时,我得到以下结果:
Java 进程(单元测试):«Blårek»、«Benny»
Android 进程(在模拟器中):«Bl “rek”、“Benny”
我希望这些字符串相等,但请注意 Android 中“å”如何被倒置的问号替换。
我尝试运行 htmlCleaner.getProperties().setRecognizeUnicodeChars(true)
但没有任何运气。另外,我发现没有办法在 html clean 中强制使用 UTF-8 或 ISO-8859-1 编码,但我不确定这是否会产生影响。
这是我运行的代码:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
有谁知道什么可能导致 Android 上的解码行为不同?我想这两个环境之间的主要区别在于 Android 应用程序使用 Android 的 java.io 堆栈,而我的单元测试使用 Sun/Oracle 的堆栈。
谢谢,
盖尔
I'm using HtmlCleaner
to scrape a ISO-8859-1
encoded web site in Android.
I've implemented this in an external jar
file that I import into my Android app.
When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å
) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.
If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.
When I copy the text from the debuggers I get these results:
Java Process (Unit Test): «Blårek», «Benny»
Android Process (In emulator): «Bl�rek», «Benny»
I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.
I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true)
without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.
Here is the code i run:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.
Thanks,
Geir
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
HtmlCleaner
无法判断要使用什么编码;您仅在InputStream
中传递响应正文,但编码位于“content-type”标头中。您可以设置字符编码 在
HtmlCleaner
的属性上从 HTTP 连接进行正确的编码。但这需要您从内容类型标头中解析正确的参数。或者,您可以 传递URL
实例到HtmlCleaner
并让它管理连接。然后,它将能够访问正确解码所需的所有信息。HtmlCleaner
can't tell what encoding to use; you are passing only the body of the response in theInputStream
, but the encoding is in the "content-type" header.You can set the character encoding on the properties of the
HtmlCleaner
to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass aURL
instance toHtmlCleaner
and let it manage the connection. Then, it will have access to all the information it needs to decode properly.