如何删除非 ASCII 字符但保留句点和空格?
我正在使用 .txt 文件。我想要文件中不含非 ASCII 字符的文本字符串。但是,我想留下空格和句点。目前,我也在剥离这些。这是代码:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
我应该如何修改 onlyascii() 以留下空格和句点?我想这不是太复杂,但我无法弄清楚。
I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可以使用 string.printable,像这样:
我的机器上的 string.printable 包含:
编辑:在 Python 3 上,过滤器将返回一个可迭代的。获取字符串的正确方法是:
You can filter all characters from the string that are not printable using string.printable, like this:
string.printable on my machine contains:
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
更改为不同编解码器的一种简单方法是使用encode() 或decode()。对于您的情况,您希望转换为 ASCII 并忽略所有不支持的符号。例如,瑞典语字母 å 不是 ASCII 字符:
编辑:
Python3: str ->字节-> str
Python2:unicode -> str-> unicode
Python2:str ->统一码-> str(按相反顺序解码和编码)
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
Edit:
Python3: str -> bytes -> str
Python2: unicode -> str -> unicode
Python2: str -> unicode -> str (decode and encode in reverse order)
根据 @artfulrobot 的说法,这应该比过滤器和 lambda 更快:
在此处查看更多示例 用单个空格替换非 ASCII 字符
According to @artfulrobot, this should be faster than filter and lambda:
See more examples here Replace non-ASCII characters with a single space
您可以使用以下代码删除非英文字母:
这将返回
You may use the following code to remove non-English letters:
This will return
你的问题含糊不清;前两个句子放在一起意味着您认为空格和“句号”是非 ASCII 字符。这是不正确的。所有 ord(char) <= 127 的字符都是 ASCII 字符。例如,您的函数不包括这些字符 !"#$%&\'()*+,-./ 但包含其他几个字符,例如 []{}。
请退后一步,想一想,然后编辑您的问题来告诉我们你想要做什么,而不提及 ASCII 这个词,以及为什么你认为像 ord(char) >= 128 这样的字符是可以忽略的。另外:Python 的版本是什么?
请注意。那你的代码将整个输入文件作为单个字符串读取,并且您对另一个答案的评论(“伟大的解决方案”)意味着您不关心数据中的换行符,如果您的文件包含这样的两行:
结果将是 <。 code>'这是第 1 行,这是第 2 行' ...这是您真正想要的吗?
更好的解决方案包括:
onlyascii
更好的名称认识到过滤器函数只需要返回一个如果要保留参数,则真值:
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
the result would be
'this is line 1this is line 2'
... is that what you really want?A greater solution would include:
onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
通过 Fluent Python (Ramalho) 进行工作 - 强烈推荐。
受第 2 章启发的列表理解 one-ish-liners:
Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
如果您想要可打印的 ascii 字符,您可能应该将代码更正为:
这相当于
string.printable
(来自 @jterrace 的回答),除了缺少回车符和制表符('\t', '\n'、'\x0b'、'\x0c' 和 '\r'),但与您问题的范围不符If you want printable ascii characters you probably should correct your code to:
this is equivalent, to
string.printable
(answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question这是获取 ascii 字符和干净代码的最佳方法,检查所有可能的错误
this is best way to get ascii characters and clean code, Checks for all possible errors