iOS HTML Unicode 到 NSString?

发布于 2024-12-07 00:37:35 字数 239 浏览 4 评论 0 原文

我正在将 Android 应用程序移植到 iOS,但遇到了一个小障碍。我正在从网页中提取 HTML 编码数据,但某些数据以 Unicode 形式呈现以显示外来字符...因此俄语 (Лети за мной) 中的字符将被解析为 "Л ет..."

在 android 中,我可以通过调用 HTML.fromHTML() 来解决这个问题。 iOS中有类似的东西吗?

I'm in the process of porting an Android app to iOS and I've hit a small roadblock. I'm pulling HTML encoded data from a webpage but some of the data is presented in Unicode to display foreign characters... so characters in Russian (Лети за мной) will be parsed out as, "Лет..."

In android I was able to get around this by calling HTML.fromHTML(). Is there anything similar in iOS?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

奢华的一滴泪 2024-12-14 00:37:36

编写自己的 HTML 实体解码器非常容易。只需扫描字符串查找 &,阅读下面的 ;,然后解释结果。如果是“amp”、“lt”、“gt”或“quot”,请将其替换为相关字符。如果以#开头,则它是一个数字实体。如果 # 后跟“x”,则将其余部分视为十六进制,否则视为十进制。读取数字,然后将字符插入到字符串中(如果您要写入 NSMutableString,则可以使用 [strappendFormat:@"%C", thechar] .NSScanner 可以使字符串扫描变得非常容易,特别是因为它已经知道如何读取十六进制数字,所以

我刚刚创建了一个应该为您执行此操作的函数。 ,所以你应该按步骤运行它:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}

It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an NSMutableString you can use [str appendFormat:@"%C", thechar]. NSScanner can make the string scanning pretty easy, especially since it already knows how to read hex numbers.

I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}
硪扪都還晓 2024-12-14 00:37:36

HTML(和 XML)中的 &#(number); 构造称为字符引用。它不是特定于 Unicode 的,除了 HTML 中的所有字符都是根据 Unicode 定义的,无论是逐字包含还是编码为字符或实体引用。 (实体引用是看起来像 é& 的命名引用,如果您正在抓取 HTML 页面,您肯定必须将它们处理为好吧。)

标准库中没有用于解码字符或实体引用的函数。有关解码 HTML 文本内容的方法,请参阅此问题。如果您只有字符引用和标准 XML 实体(如 &),您可以利用 NSXMLParser 来解析 >+yourstring+,但这不会处理 HTML 特定的实体,例如 é

一般来说,屏幕抓取最好使用适当的 HTML 解析器来完成,而不是字符串黑客。这会将所有文本内容转换为文本节点,同时转换字符和实体引用。然而,同样,标准库中没有可用的 HTML 解析器。如果目标页面是格式正确的独立 XHTML,您可以再次使用 NSXMLParser。否则,您可能想尝试 libxml2,它提供了 HTML 解析器和 XML。请参阅此问题< /a> 了解一些背景。

The &#(number); construct in HTML (and XML) is known as a character reference. It's not Unicode-specific, other than in that all characters in HTML are defined in terms of Unicode, whether included verbatim or encoded as a character or entity reference. (Entity references are the named ones that look like é or & and if you are scraping an HTML page you will certainly have to deal with those as well.)

There isn't a function in the standard library for decoding character or entity references. See this question for approaches to decoding HTML text content. If you only have character references and the standard XML entities like & you can get away with leveraging NSXMLParser to parse an <element>+yourstring+</element>, but this won't handle HTML-specific entities like é.

In general, screen-scraping is best done using a proper HTML parser, rather than string-hacking. This will convert all text content into text nodes, converting the character and entity references as it goes. However, again, there is no HTML parser available in the standard library. If the target page is well-formed standalone XHTML you can again use NSXMLParser. Otherwise you might like to try libxml2, which offers an HTML parser as well as XML. See this question for some background.

久随 2024-12-14 00:37:36

如果您从网站获取数据,您将有一个 NS(Mutable)Data 对象作为您的接收缓冲区。您只需通过以下方式将 NSData 转换为 NSString
NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
如果您的服务器以 Unicode 发送。如果您的服务器发送 utf-8 或其他格式,那么您还必须调整接收代码中的字符串编码。

这里是所有支持的字符串编码类型的列表

编辑:
看看这个so-thread

if you get data from a website you will have an NS(Mutable)Data-Object as your receiving-buffer. You just have to transform that NSData into an NSString via:
NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
if your server is sending in Unicode. If your server is sending utf-8 or other then you have to adjust the stringencoding in your receiving-code as well.

here a list of all supported string-encoding-types

edit:
take a look at this so-thread.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文