用于单元测试的 Unicode 测试字符串
我需要一些 Utf32 测试字符串来练习一些跨平台字符串操作代码。我想要一套测试字符串来测试 utf32 <-> utf16 <->; utf8 编码来验证 BMP 之外的字符可以从 utf32、通过 utf16 代理、通过 utf8 进行转换,然后再转换回来。适当地。
而且我总是发现,如果所讨论的字符串不仅仅是由随机字节组成,而且在它们编码的(各种)语言中实际上是有意义的,那么它会更优雅。
I need some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
虽然这并不完全是您所要求的,但我一直发现这个测试文档很有用。
http://www.cl.cam。 ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
同一网站提供此
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
...相当于英语的“ Quick Brown Fox”文本,练习了多种语言所使用的所有字符。此页面引用了维基百科上曾经存在的更大的“pangrams”列表,但显然已被删除。它仍然可以在这里找到:
http://clagnut.com/blog/2380/
Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/
https://github.com/noct/cutf/tree/master/bin
包括以下文件:
https://github.com/noct/cutf/tree/master/bin
Includes following files:
要真正测试格式之间所有可能的转换,而不是字符转换(即
towupper()
、towlower()
),您应该测试所有字符。以下循环为您提供了所有这些:这样您就可以确保不会错过任何内容(即 100% 完成测试)。这只有 1,112,065 个字符,因此对于现代计算机来说速度会非常快。
请注意,对于编码之间的基本转换,我上面的循环已经足够了。然而,Unicode 中还有其他功能,需要测试一起使用时表现不同的字符对。这里确实没有必要。
另外,我现在有一个单独的 C++ libutf8 库 来在 UTF-32、UTF-16 和UTF-8。测试使用如上所示的循环。测试还验证是否可以正确捕获使用无效字符代码。
To really test all possible conversions between formats, opposed to character conversions (i.e.
towupper()
,towlower()
) you should test all characters. The following loop gives you all of those:That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.
嗯,
您可以通过谷歌搜索找到很多附带数据(并查看右侧的列以了解此类问题...)
但是,我建议您将测试字符串构建为字节数组。它实际上与“什么数据”无关,而只是正确处理 unicode。
例如,您需要确保不同标准化形式(即即使不是规范形式)的相同字符串仍然比较相等。
您需要检查字符串长度检测是否可靠(并识别单字节、双字节、三字节和四字节字符)。您需要检查从头到尾遍历字符串是否遵循相同的逻辑。对 unicode 字符的随机访问进行更有针对性的测试。
我确信这些都是你知道的事情。我只是将它们拼出来提醒您,您需要测试数据来准确地满足边缘情况,即 Unicode 固有的逻辑属性。
只有这样你才会有正确的测试数据。
超出此范围(技术上正确的 Unicode 处理)的是实际本地化(排序规则、字符集转换等)。我参考了土耳其测试,
这里有一些有用的链接:
Hmmm
You could find a lot of incidental data by googling (and see the right column for questions like these on SO...)
However, I recommend you pretty much build your test strings as byte array. It is not really about 'what data', just that unicode gets handled correctly.
E.g. you will want to make sure that identical strings in different normalized forms (i.e. even if not in canonical form) still compare equal.
You will want to check that the string length detection is robust (and recognizes single, double, triple and quadruple byte characters). You will want to check that traversing a string from begin to end honours the same logic. More targeted tests for random access of unicode characters.
These are all things you knew, I'm sure. I'm just spelling them out to remind you that you need test data catered to exactly the edge cases, the logical properties that are intrinsic to Unicode.
Only then will you have proper test data.
Beyond this scope (technical correct Unicode handling) is actual localization (collation, charset conversion etc.). I refer to the Turkey Test
Here are helpful links:
你可以试试这个(有一些俄语、希腊语、中文等的句子来测试Unicode):
http://www.madore.org/~david/misc/unitest/
You can try this one (there are some sentences in russian, greek, chinese, etc. to test Unicode):
http://www.madore.org/~david/misc/unitest/