iOS HTML Unicode 到 NSString？

发布于 2024-12-07 00:37:35 字数 239 浏览 4 评论 0 原文

我正在将 Android 应用程序移植到 iOS，但遇到了一个小障碍。我正在从网页中提取 HTML 编码数据，但某些数据以 Unicode 形式呈现以显示外来字符...因此俄语 (Лети за мной) 中的字符将被解析为 "Л ет..."

在 android 中，我可以通过调用 HTML.fromHTML() 来解决这个问题。 iOS中有类似的东西吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奢华的一滴泪 2024-12-14 00:37:36

编写自己的 HTML 实体解码器非常容易。只需扫描字符串查找 &，阅读下面的 ;，然后解释结果。如果是“amp”、“lt”、“gt”或“quot”，请将其替换为相关字符。如果以#开头，则它是一个数字实体。如果 # 后跟“x”，则将其余部分视为十六进制，否则视为十进制。读取数字，然后将字符插入到字符串中（如果您要写入 NSMutableString，则可以使用 [strappendFormat:@"%C", thechar] .NSScanner 可以使字符串扫描变得非常容易，特别是因为它已经知道如何读取十六进制数字，所以

我刚刚创建了一个应该为您执行此操作的函数。，所以你应该按步骤运行它：

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}

It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an NSMutableString you can use [str appendFormat:@"%C", thechar]. NSScanner can make the string scanning pretty easy, especially since it already knows how to read hex numbers.

I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}

回复收藏 0 原文

硪扪都還晓 2024-12-14 00:37:36

HTML（和 XML）中的 &#(number); 构造称为字符引用。它不是特定于 Unicode 的，除了 HTML 中的所有字符都是根据 Unicode 定义的，无论是逐字包含还是编码为字符或实体引用。（实体引用是看起来像 é 或 & 的命名引用，如果您正在抓取 HTML 页面，您肯定必须将它们处理为好吧。）

标准库中没有用于解码字符或实体引用的函数。有关解码 HTML 文本内容的方法，请参阅此问题。如果您只有字符引用和标准 XML 实体（如 &），您可以利用 NSXMLParser 来解析 >+yourstring+，但这不会处理 HTML 特定的实体，例如 é。

一般来说，屏幕抓取最好使用适当的 HTML 解析器来完成，而不是字符串黑客。这会将所有文本内容转换为文本节点，同时转换字符和实体引用。然而，同样，标准库中没有可用的 HTML 解析器。如果目标页面是格式正确的独立 XHTML，您可以再次使用 NSXMLParser。否则，您可能想尝试 libxml2，它提供了 HTML 解析器和 XML。请参阅此问题< /a> 了解一些背景。

回复收藏 0 原文

久随 2024-12-14 00:37:36

如果您从网站获取数据，您将有一个 NS(Mutable)Data 对象作为您的接收缓冲区。您只需通过以下方式将 NSData 转换为 NSString：
NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
如果您的服务器以 Unicode 发送。如果您的服务器发送 utf-8 或其他格式，那么您还必须调整接收代码中的字符串编码。

这里是所有支持的字符串编码类型的列表

编辑：
看看这个so-thread。