从文件中删除非 ASCII 行？

发布于 2024-12-29 08:43:36 字数 335 浏览 5 评论 0原文

有没有办法从文件中删除非 ASCII 行（不是字符）？因此，给出这样的内容：

Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)

我想要：

Line 1
Line 3

我知道我可以使用 iconv 来删除 ASCII 字符，但我想删除任何包含非 ascii 行的行。有没有实用/Pythonic 的方法来做到这一点？

原文

Is there a way I can remove non-ascii lines (not characters) from a file? So given something like this:

Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)

I want:

Line 1
Line 3

I know I can use iconv to remove ASCII characters but I want to delete any line that contains non-ascii lines. Is there a utility/pythonic way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅的被狗咬 2025-01-05 08:43:36

如果要消除包含任何非 ascii 字符的行：

def ascii_lines(iterable):
    for line in iterable:
        if all(ord(ch) < 128 for ch in line):
            yield line

f = open('somefile.txt')
for line in ascii_lines(f):
    print line

If you want to eliminate lines that contain any non-ascii characters:

def ascii_lines(iterable):
    for line in iterable:
        if all(ord(ch) < 128 for ch in line):
            yield line

f = open('somefile.txt')
for line in ascii_lines(f):
    print line

回复收藏 0 原文

肤浅与狂妄 2025-01-05 08:43:36

给定如下字符串：

>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe

您可以简单地按您的条件过滤它：

>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd

qwe

此外，您也可以通过转换为 unicode 获得相同的结果：

>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'

要删除空行，我会使用 re.sub(' \n+', '\n', s)。

Given string like the next:

>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe

You may simply filter it by your criteria:

>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd

qwe

Also you may achieve the same result with converting to unicode:

>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'

To remove empty lines I'd use re.sub('\n+', '\n', s).

回复收藏 0 原文

红玫瑰 2025-01-05 08:43:36

在实践中，您需要对数据进行一些操作，并且需要进一步解析它。如果您的文件 test 看起来像

http://example.com dog
http://example.com/√•%20√§%20√∂/ foo
http://google.com bar

pyparsing 脚本会删除像这样的坏行，

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii  = u''.join(unichr(x) for x in xrange(33,127))
words  = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line  = SkipTo(EOL,include=True)

blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()

P = grammar.parseFile("test")
print P

这将给出输出：

[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]

其他方法的优点（工作正常，并回答问题），因为您现在有了一个很好的解析树来进一步操作数据。这个想法是为任何可能比刚开始时变得更复杂的任务编写一个语法，而不是解析器。

In practice you'll want to do something with the data, and need to parse it further. If your file test looks like

http://example.com dog
http://example.com/√•%20√§%20√∂/ foo
http://google.com bar

A pyparsing script would remove the bad lines like so

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii  = u''.join(unichr(x) for x in xrange(33,127))
words  = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line  = SkipTo(EOL,include=True)

blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()

P = grammar.parseFile("test")
print P

Which would give as output:

[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]

The advantage to the other methods (which work fine, and answer the question), as that you now have a nice parse tree to further manipulate the data. The idea is to write a grammar, not a parser, for any task that has the potential to become more complicated then when first started.

回复收藏 0 原文

爱的十字路口 2025-01-05 08:43:36

for line in fin:
    try:
        fout.write(line.encode('ASCII'))
    except UnicodeDecodeError:
        pass

for line in fin:
    try:
        fout.write(line.encode('ASCII'))
    except UnicodeDecodeError:
        pass

回复收藏 0 原文

戏蝶舞 2025-01-05 08:43:36

LC_ALL=C grep -v 
grep -v 打印所有与模式不匹配的行。 LC_ALL=C 将区域设置设置为“C”。 $'[^\t\r -~]' 是一种模式，在 C 语言环境中，表示“包含不是水平制表符、换行符、空格或一个 ASCII 字形字符”。 （$'...' 是 Bash 表示法：它等同于 '...'，只不过它处理像 \t 和 \r 是“负字符类”，意思是“...中未列出的任何字符”。  在字符类中，- 匹配一个范围：在此在这种情况下，从空格到波形符的范围是理解这个“范围”所必需的。）
[^\t\r -~]'

grep -v 打印所有与模式不匹配的行。 LC_ALL=C 将区域设置设置为“C”。 $'[^\t\r -~]' 是一种模式，在 C 语言环境中，表示“包含不是水平制表符、换行符、空格或一个 ASCII 字形字符”。（$'...' 是 Bash 表示法：它等同于 '...'，只不过它处理像 \t 和 \r 是“负字符类”，意思是“...中未列出的任何字符”。在字符类中，- 匹配一个范围：在此在这种情况下，从空格到波形符的范围是理解这个“范围”所必需的。）

LC_ALL=C grep -v 
grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)
[^\t\r -~]'

grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)