C: 转换特殊 ASCII 字符 ÄÖÜ
我正在使用 Curl 朗读网站上的文本。所有原始数据都按字符返回,
return memEof(mp) ? EOF : (int)(*(unsigned char *)(mp->readptr++));
我的问题是,所有特殊字符(例如 äÖÜäöüß 等)都是错误的,并且看起来非常神秘。我目前正在通过使用此表调整其值来手动更正它们:
http://www.barcoderesource.com/ Barcodeasciicharacters.shtml
我现在想知道是否有一种更优雅的方法来做到这一点以及其他人如何处理此类问题。
I'm reading out text from a website with Curl. All the rawdata is being returned character by character with
return memEof(mp) ? EOF : (int)(*(unsigned char *)(mp->readptr++));
My problem is, that all the special characters such as ÄÖÜäöüß etc are all wrong and look very cryptic. I'm currently correcting them manually by adjusting their values using this table:
http://www.barcoderesource.com/barcodeasciicharacters.shtml
I was wondering now, if there is a more elegant way to do this and how others approach these kinds of issues.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个编码问题。如果您逐字节读取数据,则可以正确且轻松地处理单字节编码(例如 ISO-8859“系列”等),前提是您有一种方法可以将它们正确转换为目标编码(如果需要)。但是对于像 UTF-8 这样的编码,你就不那么幸运了,因为要获得正确的代码,你需要读取 1 个字节,或者可能是 2 个,或者可能是 3 个字节......如果你将它们流式传输到一个字符串中,然后完全打印该字符串,并且输出设备的预期编码与输入编码相同,无论如何您都会得到正确的字符显示。
如果这种情况没有发生,并且您没有将每个字节当作一个符号来打印,那么输出设备的预期编码与写入字符串的编码不匹配。
如果输出,一旦你打印字符串“altogether”看起来没问题,那么问题是你将每个字节解释为单个字符,而事实并非如此(你对 char 有一个多字节编码,就像你提到的特殊编码一样;可能它是 UTF-8 但也可能不是)。
如果在两种情况下得到相同的结果(当您逐一打印每个字节时以及当您输出保留整个字节序列的字符串时),则输出设备的预期编码是与输入编码类似的单字节编码,但它们确实如此不匹配。
进一步的细节需要知道如何收集所读取的字节以便打印它们并说它们看起来很神秘。
一个例子。
如果输出设备编码为 UTF-8,则会得到不同的结果;如果它是单字节编码,您将获得相同的输出(换行符除外),但相对于我所写的内容来说,它是“错误的”,即 èòà。
“相同”文本在 Latin1 中为“\xe8\xf2\xe0”。 Latin1是单字节编码,所以上面的言论适用。如果在理解 utf-8 的终端上打印,您可以获得类似 �� 的内容...
因此,编码很重要,设备/格式输出编码也很重要,并且您必须了解两者才能正确处理和显示文本。 (关于格式,一个例子可以是html,您可以在其中指定内容的编码......您必须保持一致,然后您会看到一切都很好)
This is an encoding issue. If you read data byte by byte, you can handle correctly and easily just single byte encodings (like ISO-8859 "family" and many more), provided you have a way to convert them correctly in a target encoding, if you need. But with encodings like UTF-8 you are less lucky, since to get the right code you need to read 1 byte, or maybe 2, or maybe three... If you stream them into a string, and print the string altogether, and the output device intended encoding is the same of the input encoding, you get the right char shown anyway.
If it does not happen, and you are not printing each byte as if it were a symbol for sure, then the output device intended encoding does not match the one the string is written with.
If the output, once you print the string "altogether" looks ok, then the problem is that you are interpreting each byte as a single character, while it is not (you have a multibyte encoding for char like the special one you mentioned; likely it is UTF-8 but it could be not too).
If you get equal results in both cases (when you print each byte one by one and when you output a string that keeps the whole byte sequence), then the output device intended encoding is a single byte encoding like the input encoding, but they do not match.
Further details would need to know how you collect the bytes you read in order to print them and say that they looks cryptic.
An example.
You obtain different results if the output device encoding is UTF-8; if it is a single byte encoding, you obtain the same output (newlines apart), but it is "wrong" with respect to what I've written, i.e. èòà.
The "same" text is, in Latin1, "\xe8\xf2\xe0". Latin1 is a single byte encoding, so the above speech applies. If printed on a terminal understanding utf-8, you can obtain something like �� ...
So, encodings matter, device/format output encoding matters too, and you must be aware of both in order to handle and show properly the text. (About format, an example could be html, where you can specify the encoding of the content... you must be coherent, and you'll see everything fine)
我想你必须使用像 iconv 这样的外部库来创建
wchar_t 包含数据的字符串。这取决于所使用的字符编码。
I guess you have to use an external library like iconv to create a
wchar_t
string which contains the data. This depends on the used character encoding.