转换或删除“非法”内容统一码字符

发布于 2024-08-26 01:19:15 字数 293 浏览 9 评论 0原文

我有一个 MSSQL 数据库，我正在将其移植到 SQLite/Django。我使用 pymssql 连接到数据库并将文本字段保存到本地 SQLite 数据库。

然而对于某些角色来说，它会爆炸。我收到这样的抱怨：

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

有什么方法可以将字符转换为正确的 unicode 版本吗？或者把它们去掉？

原文

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.

However for some characters, it explodes. I get complaints like this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is there some way I can convert the chars to proper unicode versions? Or strip them out?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遥远的绿洲 2024-09-02 01:19:15

一旦获得字节字符串 s，不要直接将其用作 unicode obj，而是使用正确的编解码器显式转换它，例如：

u = s.decode('latin-1')

并使用 u 而不是 s 在这一点后面的代码中（大概是写入 sqlite 的部分）。假设 latin-1 是最初用于生成字节字符串的编码——我们不可能猜测，所以尝试找出答案；-)。

作为一般规则，我建议：不要在应用程序中将任何文本处理为编码的字节字符串 - 在输入后立即将它们解码为 unicode 对象，并且如有必要，在输出之前将它们编码回字节字符串。

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:

u = s.decode('latin-1')

and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).

As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

回复收藏 0 原文

不忘初心 2024-09-02 01:19:15

当您解码时，只需传递“忽略”即可剥离这些字符，

还有更多剥离/转换这些字符的方法

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only)

测试

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

When you decode, just pass 'ignore' to strip those characters

there is some more way of stripping / converting those are

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only)

Test

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

回复收藏 0 原文

~没有更多了~