当字符串中存在非 ASCII 字符时，如何将 C 字符串（字符数组）转换为 Python 字符串？

发布于 2024-07-06 17:06:26 字数 1592 浏览 12 评论 0原文

我在 C 程序中嵌入了 Python 解释器。假设 C 程序从文件中读取一些字节到 char 数组中，并（以某种方式）得知这些字节表示具有某种编码（例如 ISO 8859-1、Windows-1252 或 UTF-8）的文本。如何将此 char 数组的内容解码为 Python 字符串？

Python 字符串通常应为 unicode 类型 - 例如，Windows-1252 编码输入中的 0x93 变为 u'\u0201c' 。

我尝试使用 PyString_Decode，但当字符串中存在非 ASCII 字符时，它总是失败。下面是一个失败的示例：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

错误消息为 UnicodeEncodeError: 'ascii' codec can't Encode character u'\u201c' inposition 0: ordinal not in range(128)，这表明即使我们在调用 PyString_Decode 时指定 windows_1252，也会使用 ascii 编码。

以下代码通过使用 PyString_FromString 创建未解码字节的 Python 字符串，然后调用其 decode 方法来解决该问题：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

原文

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可爱咩 2024-07-13 17:06:26

PyString_Decode 执行此操作：

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW，它基本上执行您在第二个示例中所做的操作 - 转换为字符串，然后解码该字符串。这里的问题来自 PyString_AsDecodedString，而不是 PyString_AsDecodedObject。 PyString_AsDecodedString 执行 PyString_AsDecodedObject，但随后尝试将生成的 unicode 对象转换为具有默认编码的字符串对象（对您来说，看起来像是 ASCII）。这就是它失败的地方。

我相信您需要执行两次调用 - 但您可以使用 PyString_AsDecodedObject 而不是调用 python“decode”方法。比如：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

我不完全确定 PyString_Decode 以这种方式工作背后的原因是什么。 python-dev 上的非常旧的线程似乎表明它与链接输出有关，但由于 Python 方法不做同样的事情，我不确定这是否仍然相关。

PyString_Decode does this:

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

回复收藏 0 原文

兮子 2024-07-13 17:06:26

您不想将字符串解码为 Unicode 表示形式，您只想将其视为字节数组，对吧？

只需使用 PyString_FromString：

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

仅此而已。现在您有了一个 Python str() 对象。请参阅此处的文档： https://docs.python.org/2/c- api/string.html

我对如何指定“str”或“unicode”有点困惑。如果你有非 ASCII 字符，它们就完全不同了。如果您想要解码 C 字符串并且您确切知道它采用的字符集，那么是的，PyString_DecodeString 是一个很好的起点。

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?

Just use PyString_FromString:

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

That's all. Now you have a Python str() object. See docs here: https://docs.python.org/2/c-api/string.html

I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes, PyString_DecodeString is a good place to start.

回复收藏 0 原文