从文件中删除非 ASCII *行*?
有没有办法从文件中删除非 ASCII 行(不是字符)?因此,给出这样的内容:
Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)
我想要:
Line 1
Line 3
我知道我可以使用 iconv 来删除 ASCII 字符,但我想删除任何包含非 ascii 行的行。有没有实用/Pythonic 的方法来做到这一点?
Is there a way I can remove non-ascii lines (not characters) from a file? So given something like this:
Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)
I want:
Line 1
Line 3
I know I can use iconv
to remove ASCII characters but I want to delete any line that contains non-ascii lines. Is there a utility/pythonic way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果要消除包含任何非 ascii 字符的行:
If you want to eliminate lines that contain any non-ascii characters:
给定如下字符串:
您可以简单地按您的条件过滤它:
此外,您也可以通过转换为
unicode
获得相同的结果:要删除空行,我会使用
re.sub(' \n+', '\n', s)
。Given string like the next:
You may simply filter it by your criteria:
Also you may achieve the same result with converting to
unicode
:To remove empty lines I'd use
re.sub('\n+', '\n', s)
.在实践中,您需要对数据进行一些操作,并且需要进一步解析它。如果您的文件
test
看起来像pyparsing
脚本会删除像这样的坏行,这将给出输出:
其他方法的优点(工作正常,并回答问题),因为您现在有了一个很好的解析树来进一步操作数据。这个想法是为任何可能比刚开始时变得更复杂的任务编写一个语法,而不是解析器。
In practice you'll want to do something with the data, and need to parse it further. If your file
test
looks likeA
pyparsing
script would remove the bad lines like soWhich would give as output:
The advantage to the other methods (which work fine, and answer the question), as that you now have a nice parse tree to further manipulate the data. The idea is to write a grammar, not a parser, for any task that has the potential to become more complicated then when first started.
grep -v
打印所有与模式不匹配的行。LC_ALL=C
将区域设置设置为“C”。$'[^\t\r -~]'
是一种模式,在 C 语言环境中,表示“包含不是水平制表符、换行符、空格或一个 ASCII 字形字符”。 ($'...'
是 Bash 表示法:它等同于'...'
,只不过它处理像\t 和
\r
是“负字符类”,意思是“...中未列出的任何字符”。
在字符类中,-
匹配一个范围:在此在这种情况下,从空格到波形符的范围是理解这个“范围”所必需的。)grep -v
prints all lines that don't match the pattern.LC_ALL=C
sets the locale to "C".$'[^\t\r -~]'
is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...'
is a Bash notation: it's equivalent to'...'
, except that it processes backslash-escapes like\t
and\r
.[^...]
is a "negative character class", meaning "any character that isn't listed in...
. Inside a character class,-
matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)