从文件中删除非 ASCII *行*?

发布于 2024-12-29 08:43:36 字数 335 浏览 1 评论 0原文

有没有办法从文件中删除非 ASCII 行(不是字符)?因此,给出这样的内容:

Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)

我想要:

Line 1
Line 3

我知道我可以使用 iconv 来删除 ASCII 字符,但我想删除任何包含非 ascii 行的行。有没有实用/Pythonic 的方法来做到这一点?

Is there a way I can remove non-ascii lines (not characters) from a file? So given something like this:

Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)

I want:

Line 1
Line 3

I know I can use iconv to remove ASCII characters but I want to delete any line that contains non-ascii lines. Is there a utility/pythonic way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

帅的被狗咬 2025-01-05 08:43:36

如果要消除包含任何非 ascii 字符的行:

def ascii_lines(iterable):
    for line in iterable:
        if all(ord(ch) < 128 for ch in line):
            yield line

f = open('somefile.txt')
for line in ascii_lines(f):
    print line

If you want to eliminate lines that contain any non-ascii characters:

def ascii_lines(iterable):
    for line in iterable:
        if all(ord(ch) < 128 for ch in line):
            yield line

f = open('somefile.txt')
for line in ascii_lines(f):
    print line
肤浅与狂妄 2025-01-05 08:43:36

给定如下字符串:

>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe

您可以简单地按您的条件过滤它:

>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd

qwe

此外,您也可以通过转换为 unicode 获得相同的结果:

>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'

要删除空行,我会使用 re.sub(' \n+', '\n', s)

Given string like the next:

>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe

You may simply filter it by your criteria:

>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd

qwe

Also you may achieve the same result with converting to unicode:

>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'

To remove empty lines I'd use re.sub('\n+', '\n', s).

红玫瑰 2025-01-05 08:43:36

在实践中,您需要对数据进行一些操作,并且需要进一步解析它。如果您的文件 test 看起来像

http://example.com dog
http://example.com/å%20ä%20ö/ foo
http://google.com bar

pyparsing 脚本会删除像这样的坏行,

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii  = u''.join(unichr(x) for x in xrange(33,127))
words  = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line  = SkipTo(EOL,include=True)

blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()

P = grammar.parseFile("test")
print P

这将给出输出:

[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]

其他方法的优点(工作正常,并回答问题),因为您现在有了一个很好的解析树来进一步操作数据。这个想法是为任何可能比刚开始时变得更复杂的任务编写一个语法,而不是解析器。

In practice you'll want to do something with the data, and need to parse it further. If your file test looks like

http://example.com dog
http://example.com/å%20ä%20ö/ foo
http://google.com bar

A pyparsing script would remove the bad lines like so

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii  = u''.join(unichr(x) for x in xrange(33,127))
words  = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line  = SkipTo(EOL,include=True)

blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()

P = grammar.parseFile("test")
print P

Which would give as output:

[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]

The advantage to the other methods (which work fine, and answer the question), as that you now have a nice parse tree to further manipulate the data. The idea is to write a grammar, not a parser, for any task that has the potential to become more complicated then when first started.

爱的十字路口 2025-01-05 08:43:36
for line in fin:
    try:
        fout.write(line.encode('ASCII'))
    except UnicodeDecodeError:
        pass
for line in fin:
    try:
        fout.write(line.encode('ASCII'))
    except UnicodeDecodeError:
        pass
戏蝶舞 2025-01-05 08:43:36
LC_ALL=C grep -v 

grep -v 打印所有与模式不匹配的行。 LC_ALL=C 将区域设置设置为“C”。 $'[^\t\r -~]' 是一种模式,在 C 语言环境中,表示“包含不是水平制表符、换行符、空格或一个 ASCII 字形字符”。 ($'...' 是 Bash 表示法:它等同于 '...',只不过它处理像 \t 和 \r 是“负字符类”,意思是“...中未列出的任何字符”。 在字符类中,- 匹配一个范围:在此在这种情况下,从空格到波形符的范围是理解这个“范围”所必需的。)

[^\t\r -~]'

grep -v 打印所有与模式不匹配的行。 LC_ALL=C 将区域设置设置为“C”。 $'[^\t\r -~]' 是一种模式,在 C 语言环境中,表示“包含不是水平制表符、换行符、空格或一个 ASCII 字形字符”。 ($'...' 是 Bash 表示法:它等同于 '...',只不过它处理像 \t 和 \r 是“负字符类”,意思是“...中未列出的任何字符”。 在字符类中,- 匹配一个范围:在此在这种情况下,从空格到波形符的范围是理解这个“范围”所必需的。)

LC_ALL=C grep -v 

grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)

[^\t\r -~]'

grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文