转换或删除“非法”内容统一码字符

发布于 2024-08-26 01:19:15 字数 293 浏览 5 评论 0原文

我有一个 MSSQL 数据库,我正在将其移植到 SQLite/Django。我使用 pymssql 连接到数据库并将文本字段保存到本地 SQLite 数据库。

然而对于某些角色来说,它会爆炸。我收到这样的抱怨:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

有什么方法可以将字符转换为正确的 unicode 版本吗?或者把它们去掉?

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.

However for some characters, it explodes. I get complaints like this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is there some way I can convert the chars to proper unicode versions? Or strip them out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

遥远的绿洲 2024-09-02 01:19:15

一旦获得字节字符串 s,不要直接将其用作 unicode obj,而是使用正确的编解码器显式转换它,例如:

u = s.decode('latin-1')

并使用 u 而不是 s 在这一点后面的代码中(大概是写入 sqlite 的部分)。假设 latin-1 是最初用于生成字节字符串的编码——我们不可能猜测,所以尝试找出答案;-)。

作为一般规则,我建议:不要在应用程序中将任何文本处理为编码的字节字符串 - 在输入后立即将它们解码为 un​​icode 对象,并且如有必要,在输出之前将它们编码回字节字符串。

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:

u = s.decode('latin-1')

and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).

As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

不忘初心 2024-09-02 01:19:15

当您解码时,只需传递“忽略”即可剥离这些字符,

还有更多剥离/转换这些字符的方法

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only) 

测试

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

When you decode, just pass 'ignore' to strip those characters

there is some more way of stripping / converting those are

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only) 

Test

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文