如何使用python删除扩展的ascii?

发布于 2024-08-10 10:20:18 字数 457 浏览 6 评论 0原文

在尝试修复 PML(Palm 标记语言)文件时,我的测试文件似乎包含非 ASCII 字符,这导致 MakeBook 抱怨。解决方案是删除 PML 中的所有非 ASCII 字符。

因此,在尝试在 python 中修复此问题时,我有

import unicodedata, fileinput

for line in fileinput.input():
    print unicodedata.normalize('NFKD', line).encode('ascii','ignore')

然而,这会导致错误该行必须是“unicode,而不是 str”。这是一个文件片段。

\B1a\B \tintense, disordered and often destructive rage†.†.†.\t

此时不太确定如何正确地将行传入进行处理。

In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to complain. The solution would be to strip out all the non-ASCII chars in the PML.

So in attempting to fix this in python, I have

import unicodedata, fileinput

for line in fileinput.input():
    print unicodedata.normalize('NFKD', line).encode('ascii','ignore')

However, this results in an error that line must be "unicode, not str". Here's a file fragment.

\B1a\B \tintense, disordered and often destructive rage†.†.†.\t

Not quite sure how to properly pass line in to be processed at this point.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏花。依旧 2024-08-17 10:20:18

尝试 print line.decode('iso-8859-1').encode('ascii', 'ignore') ——这应该更接近你想要的。

Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.

不必了 2024-08-17 10:20:18

您希望将 line 视为 ASCII 编码数据,因此答案是使用 ascii 编解码器将其解码为文本:

line.decode('ascii')

这将对于实际上不是 ASCII 编码的数据引发错误。这是忽略这些错误的方法:

line.decode('ascii', 'ignore')

这将为您提供 unicode 实例形式的文本。如果您更愿意使用(ascii 编码的)数据而不是文本,您可以重新编码以获取 strbytes 实例(取决于您的版本) Python):

line.decode('ascii', 'ignore').encode('ascii')

You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:

line.decode('ascii')

This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:

line.decode('ascii', 'ignore').

This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):

line.decode('ascii', 'ignore').encode('ascii')

难以启齿的温柔 2024-08-17 10:20:18

要删除非 ASCII 字符,请使用 line.decode(your_file_encoding).encode('ascii', 'ignore')。但也许您最好对它们使用 PLM 转义序列:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

这会输出 \B1a\B \tintense、无序且经常具有破坏性的 rage\U2020.\U2020.\U2020.\t

使用正则表达式删除非 ASCII 和控制字符也很容易(转义后可以安全使用):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)

To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.

Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)
夢归不見 2024-08-17 10:20:18

在 Python 中读取文件时,您将获得字节字符串,在 Python 2.x 及更早版本中也称为“str”。您需要使用 decode 方法将它们转换为“unicode”类型。例如:

line = line.decode('latin1')

用正确的编码替换“latin1”。

When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:

line = line.decode('latin1')

Replace 'latin1' with the correct encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文