用于单元测试的 Unicode 测试字符串

发布于 2024-11-09 13:14:39 字数 224 浏览 0 评论 0原文

我需要一些 Utf32 测试字符串来练习一些跨平台字符串操作代码。我想要一套测试字符串来测试 utf32 <-> utf16 <->; utf8 编码来验证 BMP 之外的字符可以从 utf32、通过 utf16 代理、通过 utf8 进行转换，然后再转换回来。适当地。

而且我总是发现，如果所讨论的字符串不仅仅是由随机字节组成，而且在它们编码的（各种）语言中实际上是有意义的，那么它会更优雅。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风尘浪孓 2024-11-16 13:14:39

虽然这并不完全是您所要求的，但我一直发现这个测试文档很有用。

http://www.cl.cam。 ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

同一网站提供此

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt

...相当于英语的“ Quick Brown Fox”文本，练习了多种语言所使用的所有字符。此页面引用了维基百科上曾经存在的更大的“pangrams”列表，但显然已被删除。它仍然可以在这里找到：

http://clagnut.com/blog/2380/

回复收藏 0 原文

七分※倦醒 2024-11-16 13:14:39

https://github.com/noct/cutf/tree/master/bin

包括以下文件：

UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt

https://github.com/noct/cutf/tree/master/bin

Includes following files:

UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt

回复收藏 0 原文

羁绊已千年 2024-11-16 13:14:39

要真正测试格式之间所有可能的转换，而不是字符转换（即 towupper()、towlower()），您应该测试所有字符。以下循环为您提供了所有这些：

for(wint_t c(0); c < 0x110000; ++c)
{
    if(c >= 0xD800 && c <= 0xDFFF)
    {
        continue;
    }
    // here 'c' is any one Unicode character in UTF-32
    ...
}

这样您就可以确保不会错过任何内容（即 100% 完成测试）。这只有 1,112,065 个字符，因此对于现代计算机来说速度会非常快。

请注意，对于编码之间的基本转换，我上面的循环已经足够了。然而，Unicode 中还有其他功能，需要测试一起使用时表现不同的字符对。这里确实没有必要。

另外，我现在有一个单独的 C++ libutf8 库来在 UTF-32、UTF-16 和UTF-8。测试使用如上所示的循环。测试还验证是否可以正确捕获使用无效字符代码。

To really test all possible conversions between formats, opposed to character conversions (i.e. towupper(), towlower()) you should test all characters. The following loop gives you all of those:

for(wint_t c(0); c < 0x110000; ++c)
{
    if(c >= 0xD800 && c <= 0xDFFF)
    {
        continue;
    }
    // here 'c' is any one Unicode character in UTF-32
    ...
}

That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.

Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.

Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.

回复收藏 0 原文