iOS HTML Unicode 到 NSString?
我正在将 Android 应用程序移植到 iOS,但遇到了一个小障碍。我正在从网页中提取 HTML 编码数据,但某些数据以 Unicode 形式呈现以显示外来字符...因此俄语 (Лети за мной) 中的字符将被解析为 "Л ет..."
在 android 中,我可以通过调用 HTML.fromHTML() 来解决这个问题。 iOS中有类似的东西吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编写自己的 HTML 实体解码器非常容易。只需扫描字符串查找 &,阅读下面的 ;,然后解释结果。如果是“amp”、“lt”、“gt”或“quot”,请将其替换为相关字符。如果以#开头,则它是一个数字实体。如果 # 后跟“x”,则将其余部分视为十六进制,否则视为十进制。读取数字,然后将字符插入到字符串中(如果您要写入
NSMutableString
,则可以使用[strappendFormat:@"%C", thechar]
.NSScanner
可以使字符串扫描变得非常容易,特别是因为它已经知道如何读取十六进制数字,所以我刚刚创建了一个应该为您执行此操作的函数。 ,所以你应该按步骤运行它:
It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an
NSMutableString
you can use[str appendFormat:@"%C", thechar]
.NSScanner
can make the string scanning pretty easy, especially since it already knows how to read hex numbers.I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:
HTML(和 XML)中的
&#(number);
构造称为字符引用。它不是特定于 Unicode 的,除了 HTML 中的所有字符都是根据 Unicode 定义的,无论是逐字包含还是编码为字符或实体引用。 (实体引用是看起来像é
或&
的命名引用,如果您正在抓取 HTML 页面,您肯定必须将它们处理为好吧。)标准库中没有用于解码字符或实体引用的函数。有关解码 HTML 文本内容的方法,请参阅此问题。如果您只有字符引用和标准 XML 实体(如
&
),您可以利用NSXMLParser
来解析
>+yourstring+,但这不会处理 HTML 特定的实体,例如
é
。一般来说,屏幕抓取最好使用适当的 HTML 解析器来完成,而不是字符串黑客。这会将所有文本内容转换为文本节点,同时转换字符和实体引用。然而,同样,标准库中没有可用的 HTML 解析器。如果目标页面是格式正确的独立 XHTML,您可以再次使用 NSXMLParser。否则,您可能想尝试 libxml2,它提供了 HTML 解析器和 XML。请参阅此问题< /a> 了解一些背景。
The
&#(number);
construct in HTML (and XML) is known as a character reference. It's not Unicode-specific, other than in that all characters in HTML are defined in terms of Unicode, whether included verbatim or encoded as a character or entity reference. (Entity references are the named ones that look likeé
or&
and if you are scraping an HTML page you will certainly have to deal with those as well.)There isn't a function in the standard library for decoding character or entity references. See this question for approaches to decoding HTML text content. If you only have character references and the standard XML entities like
&
you can get away with leveragingNSXMLParser
to parse an<element>
+yourstring+</element>
, but this won't handle HTML-specific entities likeé
.In general, screen-scraping is best done using a proper HTML parser, rather than string-hacking. This will convert all text content into text nodes, converting the character and entity references as it goes. However, again, there is no HTML parser available in the standard library. If the target page is well-formed standalone XHTML you can again use
NSXMLParser
. Otherwise you might like to try libxml2, which offers an HTML parser as well as XML. See this question for some background.如果您从网站获取数据,您将有一个 NS(Mutable)Data 对象作为您的接收缓冲区。您只需通过以下方式将
NSData
转换为NSString
:NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
如果您的服务器以 Unicode 发送。如果您的服务器发送 utf-8 或其他格式,那么您还必须调整接收代码中的字符串编码。
这里是所有支持的字符串编码类型的列表
编辑:
看看这个so-thread。
if you get data from a website you will have an
NS(Mutable)Data
-Object as your receiving-buffer. You just have to transform thatNSData
into anNSString
via:NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
if your server is sending in Unicode. If your server is sending utf-8 or other then you have to adjust the stringencoding in your receiving-code as well.
here a list of all supported string-encoding-types
edit:
take a look at this so-thread.