让python默认用字符串替换不可编码的字符

发布于 2024-08-15 05:23:27 字数 307 浏览 6 评论 0原文

我想让 python 忽略它无法编码的字符，只需将它们替换为字符串 "" 即可。

例如，假设默认编码是 ascii，该命令

'%s is the word'%'ébác'

将产生

'<could not encode>b<could not encode>c is the word'

是否有任何方法可以使其成为我所有项目的默认行为？

原文

I want to make python ignore chars it can't encode, by simply replacing them with the string "<could not encode>".

E.g, assuming the default encoding is ascii, the command

'%s is the word'%'ébác'

would yield

'<could not encode>b<could not encode>c is the word'

Is there any way to make this the default behavior, across all my project?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

趁年轻赶紧闹 2024-08-22 05:23:27

str.encode 函数采用可选的定义错误处理的参数：

str.encode([encoding[, errors]])

来自文档：

返回字符串的编码版本。默认编码是当前默认的字符串编码。可以给出错误来设置不同的错误处理方案。错误的默认值是“strict”，这意味着编码错误会引发 UnicodeError。其他可能的值包括“ignore”、“replace”、“xmlcharrefreplace”、“backslashreplace”以及通过 codecs.register_error() 注册的任何其他名称，请参阅编解码器基类部分。有关可能的编码的列表，请参阅标准编码部分。

在您的情况下， codecs.register_error函数可能会让人感兴趣。

[关于坏字符的注意事项]

顺便说一下，请注意，在使用 register_error 时，您可能会发现自己不仅用字符串替换了单个坏字符，还替换了一组连续的坏字符，除非你注意。每次运行坏字符，而不是每个字符，您都会调用一次错误处理程序。

The str.encode function takes an optional argument defining the error handling:

str.encode([encoding[, errors]])

From the docs:

Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.

In your case, the codecs.register_error function might be of interest.

[Note about bad chars]

By the way, note when using register_error that you'll likely find yourself replacing not just individual bad characters but groups of consecutive bad characters with your string, unless you pay attention. You get one call to the error handler per run of bad chars, not per char.

回复收藏 0 原文

太傻旳人生 2024-08-22 05:23:27

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

因此，例如：

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

将您自己的回调添加到 codecs.register_error 以替换为您选择的字符串。

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

So, for instance:

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

Add your own callback to codecs.register_error to replace with the string of your choice.

回复收藏 0 原文

打小就很酷 2024-08-22 05:23:27

codecs.register_error
的最小示例
基于这个答案（这更详细）

#!/usr/bin/env python3

import codecs

def some_handler(exception):
    return (b"-", exception.end)

codecs.register_error("some_handler", some_handler)

s = 'A\uff1aB' # \uff1a = Fullwidth Colon
_bytes = s.encode("latin1", errors="some_handler")
print(repr(_bytes)) # b'A-B'

minimal example for codecs.register_error
based on this answer (which is more verbose)

#!/usr/bin/env python3

import codecs

def some_handler(exception):
    return (b"-", exception.end)

codecs.register_error("some_handler", some_handler)

s = 'A\uff1aB' # \uff1a = Fullwidth Colon
_bytes = s.encode("latin1", errors="some_handler")
print(repr(_bytes)) # b'A-B'

回复收藏 0 原文

~没有更多了~