如何删除非 ASCII 字符但保留句点和空格？

发布于 2024-12-23 17:45:41 字数 455 浏览 1 评论 0原文

我正在使用 .txt 文件。我想要文件中不含非 ASCII 字符的文本字符串。但是，我想留下空格和句点。目前，我也在剥离这些。这是代码：

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

我应该如何修改 onlyascii() 以留下空格和句点？我想这不是太复杂，但我无法弄清楚。

原文

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

删除会话 2024-12-30 17:45:41

您可以使用 string.printable，像这样：

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

我的机器上的 string.printable 包含：

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

编辑：在 Python 3 上，过滤器将返回一个可迭代的。获取字符串的正确方法是：

''.join(filter(lambda x: x in printable, s))

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

回复收藏 0 原文

汐鸠 2024-12-30 17:45:41

更改为不同编解码器的一种简单方法是使用encode() 或decode()。对于您的情况，您希望转换为 ASCII 并忽略所有不支持的符号。例如，瑞典语字母 å 不是 ASCII 字符：

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

编辑：

Python3: str ->字节-> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2：unicode -> str-> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2：str ->统一码-> str（按相反顺序解码和编码）

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

回复收藏 0 原文

俯瞰星空 2024-12-30 17:45:41

根据 @artfulrobot 的说法，这应该比过滤器和 lambda 更快：

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

在此处查看更多示例用单个空格替换非 ASCII 字符

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

回复收藏 0 原文

影子的影子 2024-12-30 17:45:41

您可以使用以下代码删除非英文字母：

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

这将返回

123456790 ABC#%？ .()

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

回复收藏 0 原文

秋千易 2024-12-30 17:45:41

你的问题含糊不清；前两个句子放在一起意味着您认为空格和“句号”是非 ASCII 字符。这是不正确的。所有 ord(char) <= 127 的字符都是 ASCII 字符。例如，您的函数不包括这些字符 !"#$%&\'()*+,-./ 但包含其他几个字符，例如 []{}。

请退后一步，想一想，然后编辑您的问题来告诉我们你想要做什么，而不提及 ASCII 这个词，以及为什么你认为像 ord(char) >= 128 这样的字符是可以忽略的。另外：Python 的版本是什么？

请注意。那你的代码将整个输入文件作为单个字符串读取，并且您对另一个答案的评论（“伟大的解决方案”）意味着您不关心数据中的换行符，如果您的文件包含这样的两行：

this is line 1
this is line 2

结果将是 <。 code>'这是第 1 行，这是第 2 行' ...这是您真正想要的吗？

更好的解决方案包括：

为过滤器函数提供一个比 onlyascii 更好的名称

认识到过滤器函数只需要返回一个如果要保留参数，则真值：

def filter_func(char):
    返回 char == '\n' 或 32 <= ord(char) <= 126
# 及之后：
filtered_data = 过滤器(filter_func, data).lower()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

回复收藏 0 原文

你的呼吸 2024-12-30 17:45:41

通过 Fluent Python (Ramalho) 进行工作 - 强烈推荐。
受第 2 章启发的列表理解 one-ish-liners：

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

回复收藏 0 原文

笛声青案梦长安 2024-12-30 17:45:41

如果您想要可打印的 ascii 字符，您可能应该将代码更正为：

if ord(char) < 32 or ord(char) > 126: return ''

这相当于 string.printable （来自 @jterrace 的回答），除了缺少回车符和制表符（'\t', '\n'、'\x0b'、'\x0c' 和 '\r'），但与您问题的范围不符

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

回复收藏 0 原文

夏末的微笑 2024-12-30 17:45:41

这是获取 ascii 字符和干净代码的最佳方法，检查所有可能的错误

from string import printable

def getOnlyCharacters(texts):
    _type = None
    result = ''
    
    if type(texts).__name__ == 'bytes':
        _type = 'bytes'
        texts = texts.decode('utf-8','ignore')
    else:
        _type = 'str'
        texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')

    texts = str(texts)
    for text in texts:
        if text in printable:
            result += text
            
    if _type == 'bytes':
        result = result.encode('utf-8')

    return result

text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)

print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

this is best way to get ascii characters and clean code, Checks for all possible errors

from string import printable

def getOnlyCharacters(texts):
    _type = None
    result = ''
    
    if type(texts).__name__ == 'bytes':
        _type = 'bytes'
        texts = texts.decode('utf-8','ignore')
    else:
        _type = 'str'
        texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')

    texts = str(texts)
    for text in texts:
        if text in printable:
            result += text
            
    if _type == 'bytes':
        result = result.encode('utf-8')

    return result

text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)

print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

回复收藏 0 原文

~没有更多了~