当字符串中存在非 ASCII 字符时,如何将 C 字符串(字符数组)转换为 Python 字符串?
我在 C 程序中嵌入了 Python 解释器。 假设 C 程序从文件中读取一些字节到 char 数组中,并(以某种方式)得知这些字节表示具有某种编码(例如 ISO 8859-1、Windows-1252 或 UTF-8)的文本。 如何将此 char 数组的内容解码为 Python 字符串?
Python 字符串通常应为 unicode
类型 - 例如,Windows-1252 编码输入中的 0x93
变为 u'\u0201c'
。
我尝试使用 PyString_Decode,但当字符串中存在非 ASCII 字符时,它总是失败。 下面是一个失败的示例:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string;
Py_Initialize();
py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
if (!py_string) {
PyErr_Print();
return 1;
}
return 0;
}
错误消息为 UnicodeEncodeError: 'ascii' codec can't Encode character u'\u201c' inposition 0: ordinal not in range(128)
,这表明即使我们在调用 PyString_Decode
时指定 windows_1252
,也会使用 ascii
编码。
以下代码通过使用 PyString_FromString
创建未解码字节的 Python 字符串,然后调用其 decode
方法来解决该问题:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *raw, *decoded;
Py_Initialize();
raw = PyString_FromString(c_string);
printf("Undecoded: ");
PyObject_Print(raw, stdout, 0);
printf("\n");
decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
Py_DECREF(raw);
printf("Decoded: ");
PyObject_Print(decoded, stdout, 0);
printf("\n");
return 0;
}
I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?
The Python string should in general be of type unicode
—for instance, a 0x93
in Windows-1252 encoded input becomes a u'\u0201c'
.
I have attempted to use PyString_Decode
, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string;
Py_Initialize();
py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
if (!py_string) {
PyErr_Print();
return 1;
}
return 0;
}
The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
, which indicates that the ascii
encoding is used even though we specify windows_1252
in the call to PyString_Decode
.
The following code works around the problem by using PyString_FromString
to create a Python string of the undecoded bytes, then calling its decode
method:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *raw, *decoded;
Py_Initialize();
raw = PyString_FromString(c_string);
printf("Undecoded: ");
PyObject_Print(raw, stdout, 0);
printf("\n");
decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
Py_DECREF(raw);
printf("Decoded: ");
PyObject_Print(decoded, stdout, 0);
printf("\n");
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
PyString_Decode 执行此操作:
IOW,它基本上执行您在第二个示例中所做的操作 - 转换为字符串,然后解码该字符串。 这里的问题来自 PyString_AsDecodedString,而不是 PyString_AsDecodedObject。 PyString_AsDecodedString 执行 PyString_AsDecodedObject,但随后尝试将生成的 unicode 对象转换为具有默认编码的字符串对象(对您来说,看起来像是 ASCII)。 这就是它失败的地方。
我相信您需要执行两次调用 - 但您可以使用 PyString_AsDecodedObject 而不是调用 python“decode”方法。 比如:
我不完全确定 PyString_Decode 以这种方式工作背后的原因是什么。 python-dev 上的非常旧的线程似乎表明它与链接输出有关,但由于 Python 方法不做同样的事情,我不确定这是否仍然相关。
PyString_Decode does this:
IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.
I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:
I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.
您不想将字符串解码为 Unicode 表示形式,您只想将其视为字节数组,对吧?
只需使用
PyString_FromString
:仅此而已。 现在您有了一个 Python
str()
对象。 请参阅此处的文档: https://docs.python.org/2/c- api/string.html我对如何指定“str”或“unicode”有点困惑。 如果你有非 ASCII 字符,它们就完全不同了。 如果您想要解码 C 字符串并且您确切知道它采用的字符集,那么是的,
PyString_DecodeString
是一个很好的起点。You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?
Just use
PyString_FromString
:That's all. Now you have a Python
str()
object. See docs here: https://docs.python.org/2/c-api/string.htmlI'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes,
PyString_DecodeString
is a good place to start.尝试在“< code>if (!py_string)" 子句。 也许 python 异常会给你一些更多的信息。
Try calling
PyErr_Print()
in the "if (!py_string)
" clause. Perhaps the python exception will give you some more information.